Updated Performance optimization and tuning for Enterprise PowerVM/AIX, including POWER8 Björn Rodén works for IBM System Lab Services and member of IBM
Systems Lab Services Executive Advisory Practice; and WW PowerCare Teams for Availability, Performance, Security and Cloud. Bjorn holds MSc, BSc and DiplSSc in Informatics and BCSc and DiplCSc in Computer Science, is a IBM Redbooks Platinum Author, IBM Thought Leader Certified Specialist etc, and has worked in different roles with architecting, designing, planning, leading, implementing, programming, and assessing high availability, resilient, secure, and high performance systems and solutions since 1990. © Copyright IBM Corporation 2015 Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM.
Thanks to: Pete H, Dirk, Puvi, Sivakumar K, Brian H, DRS, Shrini, Vasu, Tim L, Bruce M, Sangeeth K, Hema Bt, Mala A, Herman D, Kiet L, Steve N, Kurt K, Anandakumar M, Niranjan S, Nigel G, Michael AM, Alan W, et al
Updated
Session Objectives § This session focus on performance optimization and baseline tuning for high workload on Enterprise Power Systems, to maximize business value – – – –
We will look into high impact areas for consideration We will focus on high yield tuning and baseline tuning for high workload utilization Focused on Enterprise and High End Power Systems 795 and 780/770 Including first notes on POWER8 E870 and E880 (with 4 CEC from 2H2015)
objective You will learn how to baseline Enterprise Power Systems/AIX for high workload. This session is based on practical experience from WW customer issues and resolutions on primarily Power 795 but also some 780/770, between 2011 and 2015. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
2
Updated
Direct client experiences with input to this session 2011-2014 1. Ecobank/Ghana 2. Commercial Bank of Ethiopia/Ethiopia 3. Construction and Business Bank (CBB)/Ethiopia 4. Awash International Bank/Ethiopia 5. Commercial Bank of Egypt (CBE)/Egypt 6. Commercial International Bank (CIB)/Egypt 7. Banque Misr/Egypt 8. Standard Bank of South Africa (SBSA)/South Africa 9. Eskom/South Africa 10. MTN/South Africa 11. Vodacom/South Africa 12. Edcon/South Africa 13. First Bank of Nigeria/Nigeria 14. Central Bank of Nigeria/Nigeria 15. UBA/Nigeria 16. Fidelity Bank/Nigeria 17. Co-operative Bank of Kenya/Kenya 18. Stanbic/Kenya 19. Equity Bank/Kenya 20. Stanbic/Uganda 21. Stanbic/Botswana 22. Meditel/Morroco 23. INWI/Morocco 24. African Development Bank (AFDB)/Tunisia 25. Tunisiana/Tunisia 26. Central Bank of Libya/Libya 27. Union Bank/Jordan 28. Meezan Bank/Pakistan 29. Pakistan Telecom Mobile Ltd (PTML)/Pakistan
Multiple >80 partitions over >200-cores and >800 virtual processors
Multiple 25630. Sberbank/Russia core31.single Unilever/United Kingdom 32. Finanz Informatik/Germany partitions 33. REWE/Germany
34. TUI InfoTec/Germany 35. BG Phoenics/Germany 36. ZF/Germany 37. Ukraine Railways/Ukraine 38. TMB Bank/Thailand 39. Polska Telefonia Cyfrowa/Poland 40. Zavarovalnica Triglav/Slovenia 41. NNIT/Denmark 42. Axfood/Sweden 43. Turkiye Is Bankasi/Turkey 44. TEB/Turkey 45. Turk Telecom/Turkey 46. Turkcell/Turkey 47. Finansbank/Turkey 48. Landmark Group/UAE 49. First Gulf Bank/UAE 50. Emirates Airlines/UAE 51. Etisalat/UAE 52. ADCB/UAE 53. Saudi Aramco/Saudi Arabia 54. Riyadh Bank/Saudi Arabia 55. Bank Saudi Fransi/Saudi Arabia 56. Al Rajhi Bank/Saudi Arabia 57. Al Inma/Saudi Arabia 58. Byblos Bank/Lebanon
1350 virtual processors over >200-cores
Core banking‌etc
Š Copyright IBM Corporation 2015
Updated
Agenda § Known high impact areas for consideration – CPU configuration & Utilization – Partition Memory Affinity & Utilization – Network I/O – Storage I/O and – AIX tunables – Adapter Placement – Firmware and System Software – Extras
Po AIX werV ha M V ve irtu va l t aliz u h es de e a s cor vice for tu ame tion e s attr na de and i b i b n les fau pa u g t l e e l rtit ion ima s for and t s o ge 2 n 7 or 5695 100 /88 + 0.
ADAPT !
High End aka Critical aka “BIG IRON”
Throughput Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
4
Updated
Baseline § A baseline is a starting point – To baseline a work product may require certain change(s) to the work product to ensure its conformance to the characteristics associated with the baseline referenced. – Based on a usually initial set of critical observations or data used for comparison or a control against known requirements.
§ Define purpose/priority – For total server throughput – For specific business critical workload performance
§ Metrics – – – –
Response time (simple and complex transactions) Throughput load (sustained/peak transaction mix over periods) Maximum user load (sustained/peak simultaneous users over periods) Business related metrics (sustained orders per hour, growth capability, …)
The key objective of Virtualization is to reduce cost and optimize total server throughput Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
5
Updated
Björn Rodén
Infrastructure Perspectives / System Levels Constraints Contentions Depletions
Enterprise environment Site environment Data Centre environment Server Storage
Server
Server
Storage
MAN WAN
Application SAN
UPS Gen.
Middleware
Local Area Network Storage Area Network
Operating System & System Software Logical/Virtual Machine
Kernel stack
Physical Machine Network
• Business requirements
Storage Hypervisor
– Business Impact
• People
Hardware (cores, cache, nest)
– Knowledge and skill
• Processes – System management
• Technology (this page) – Architecture and technology – Lifecycle: DESIGN – BUILD – OPERATE –REPLACE Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
6
Updated
Resource constraints, contentions or depletions § Focus areas – – – – –
Total server throughput (managed system) Individual partitions throughput/responsiveness Virtualization stack Micro-partitioning Advanced PowerVM features
§ Compromises/trade-offs for balance – – – –
Performance Availability Security Manageability
§ Resources – – – – – –
Network I/O Storage I/O CPU Memory Virtualization stack Hypervisor
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Dedicated Resources VIOS controlled resources Managed System (host) Managed System (host) Partition (virtual server) Partition (virtual server) +hypervisor+ AIX AIX RSCT / CA RSCT / CA SAN / LAN SAN / LAN MPIO / VPN MPIO / VPN Adapters
Constraints Contentions Depletions
Switches Routing End Points
Shared Dediated Virtual Physical Physical Adapters Ports Adapters
MPIO / VPN Adapters Switches Routing End Points
© Copyright IBM Corporation 2015
7
Updated
POWER7 High End comparison to POWER8 POWER7 795 9119-FHB POWER8 880 9119-MHE CPU sockets per node
4
4
Max processor nodes ( system units)
8
2
Max Cores
256
64
Frequency
4.0 GHz
4.4 GHz
Inter CEC SMP bus ( A bus)
336 GBs per node
307 GB/s per CEC drawer
Intra CEC SMP bus ( X bus)
576 GBs per node
922 GB/s per CEC drawer
Max memory
2 TB per node ( 64 GB SN DIMM)
4 TB per CEC drawer (128 GB CDIMM)
Memory per core
64 GB
128 GB
Memory bandwidth (peak)
546 GB/s per node
922 GB/s per CEC drawer
Memory Bandwidth per core (peak)
17 GB/sec
28.6 GB/sec
IO bandwidth (peak)
80 GB/s per node (GX)
256 GB/s per CEC drawer ( PCIe G3)
Max I/O drawers
32 (IB attached)
4 (PCIe G3 attached)
Total Disk drives
Up to 4032
Up to 1536
Max PCIe IO slots
640 in IO drawers only
48 in IO drawers + 8 slots internal
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
9
Updated
http://194.196.36.29/webapp/set2/sas/f/best/power7_performance_best_practices_v7.pdf
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
10
Updated
https://www-304.ibm.com/webapp/set2/sas/f/best/power8_performance_best_practices.pdf
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
11
Updated
CPU Utilization
© Copyright IBM Corporation 2015
Updated
Processor concept overview
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
13
Updated
POWER7 Virtualization - Best Practice Guide - Version 3.0 § The best practice for LPAR entitlement would be to match the entitlement capacity (EC) to average utilization and let the peak addressed by additional uncapped capacity (VP). – The rule of thumb would be setting entitlement close to average utilization for each of the LPAR in a system, however there are cases where a LPAR has to be given higher priority compared to other LPARs in a system, this rule can be relaxed. – For example if the production and non-production workloads are consolidated on the same system, production LPARs would be preferred to have higher priority over non-production LPARs. – In which case, in addition to setting higher weights for uncapped capacity, the entitlement of the production LPARs can be raised while reducing the entitlement of non-production LPARs. – This allows these important production LPARs to have better partition placement (affinity) and these LPARs will have additional entitled capacity so not to rely solely on uncapped processing. – At the same time if production SPLPAR is not using their entitled capacity, then that capacity can be used by nonproduction SPLPAR and the non-production SPLPAR will be pre-empted if production SPLPAR needs its capacity. – https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/61ad9cf2-c6a3-4d2c-b77961ff0266d32a/page/64c8d6ed-6421-47b5-a1a7-d798e53e7d9a/attachment/f9ddb657-2662-41d4-8fd877064cc92e48/media/p7_virtualization_bestpractice.doc – ftp://ftp.software.ibm.com/systems/power/community/wikifiles/P7_virtualization_bestpractice.doc
Note on Low Latency VIOS For optimal performance with low I/O latencies, the VIOS entitlement should be rounded up to a whole number of cores. And for high workload consider dedicated processor mode (donate). With placement within one affinity domain. https://www-912.ibm.com/wle/EstimatorServlet?view=help&topic=virtualIOHelp Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
14
Updated
Calculating for POWER Virtualization Best Practice § Calculating fit using: – Calculated VP based on 99.73rd percentile as Peak – Calculated EC based on 90th and 95th percentile as Average – Understand the workload and adjust!
§ Core:VP relationship (server&lpar) – For POWER7 with 730 firmware – System firmware 760 and 780 have significant improvements
§ Shared processor partitions – Home cores are assigned based on Entitled Capacity (Online desired) – Multiple partitions can share the same home core – Only cores with memory available for the partition are eligible to be home cores – Virtual processor (VP) can be dispatched in the entire system, if there is contention for the home core, at time VP is ready to run – Pre-empting during dispatch cycle for redispatch of VP to home core (post 730 firmware)
Manual adjustment based on knowledge of the workload. Work with new EC and new VP and MSSP max values: 1. Is the sampled workload utilization representative? 2. Should growth be included for how long planning period? 3. Should additional headroom be included, such as for cluster collocation or due to seasonal workload variations? 4. Should the System CPU Utilization Target Threshold include growth, collocation and/or additional headroom? 5. Should critical partitions have higher EC than average? 6. Should non-critical partitions have lower EC than average? 7. Where in the system life cycle are partitions? 1. Implementation 2. Operation & Maintenance 3. Displace & Decommission
§ Monitor using PowerVP – Spot check only, NOT continuously Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
15
Updated
Right sizing optimization – sample sizing model (1/3) § Understand actual workload and business requirements, over time not ad hoc § Adjust below model to fit actual workload, near term growth and workload collocation – Favor critical production partitions, with entitlement, weight value and hardware thread raw mode – Differentiate between critical partitions for optimum partition throughput and lower latency, and noncritical for optimum server throughput – Non vital production and User Acceptance Test (UAT) partitions less favored entitlement, and use more core hardware threads first (vpm_througput_mode and vpm_throughput_core_threashold) – Development partitions with less entitlement to support virtual FC, and use all hardware threads first
§ Target: – Reduce uncapped processor capacity needs and increase guaranteed capacity is available for business critical workloads – Reduce ratio EC:VP to keep total physical core:virtual processor below 1:2 to reduce change of VP dispatch and far memory access in uncapped shared processor mode – Reduce impact of AIX unfolding and SMT from UAT partitions on hypervisor VP scheduling Table 1: Sample sizing model w/uncapped shared processor VIOS Business Criticality
Weight
EC/average
VP/peak
VIOS Critical Production Production/UAT Not vital
255 100 10 1
95th pctl 95th pctl 90th pctl 90th pctl
Pair x 99.73% 99.73% 99.73% 99.73%
vpm_througp vpm_throughput lpar_placement ut_mode _core_threashold EC<=16/24/32 1 0 1 0-2-4 EC/average 4 EC/average -
Affinity Group 200…100 -
Note: Factor in growth factor, additional application workload in partitions, collocation of multi-partition workload/clustering, and account for sampling period not representing expected annual workload. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
16
Updated
Right sizing optimization – sample sizing model (2/3) § Manual adjustment based on knowledge of the workload – Partitions with weight #00 and above • If sampled utilization is below current EC, use current EC, if above use rounded new EC • Use 99.73 for VP to fit peak sampled utilization, exception VIOS for assumed collocated workload
– Spot check utilization plot over sampled period • • • • • •
If calculated ratio EC:VP >1:2 and VP >2 If sampled average and peak utilization is similar If sampled average utilization is significantly less than current EC If sampled average utilization is significantly higher than current EC If sampled peak utilization is significantly less than current VP If sampled peak utilization is near current VP
§ Work with NEW EC and NEW VP – – – –
Is the sampled workload representative? Should growth be included? Should additional headroom be included, such as cluster collocation? Should the System CPU Utilization Target Threshold include growth, collocation and additional headroom? – Should critical partitions have higher EC than average? – Should non-critical partitions have lower EC than average? – Where in the system life cycle are partitions? • Implementation • Operation & Maintenance • Disposition & Decommission
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
17
Updated
Right sizing optimization – sample sizing model (3/3) § Using dedicated virtual processors – Dedicated virtual processors with keep_idle_procs (keep unused capacity) usually provide best performance, however unused processing capacity can not be used by the shared processor pool as with share_idle_procs_always and the upper limit of virtual processors are capped – monitor dedicated process partitions for average and peak utilization.
§ Using shared virtual processor in uncapped mode – With shared virtual processors monitor utilization and set Entitled Capacity (EC) to average utilization and the number of Virtual Processors (VP) to peak utilization – create a workload specific sizing model for average calculation.
§ When using shared processor partitions in uncapped mode, ensure spread of weight values – Spread weight value with at least 50% gaps, set 255 for VIOS, and consider: 120 for core vital, 60 for non-core vital, 30 for core non-vital and 1 for non-core non-vital (core systems support primary business functions), or even more progressive 255-100-10-1.
§ Ensure sufficient desired memory 4-8GB for VIOS, if using Shared Ethernet Adapter (SEA) – Perform Frame IPL before live production inception, start with large business critical partitions to SMS mode (if using VIOS resources), then VIOS, followed by smaller and non business critical partitions. – Follow the Partition Placement guidelines to insure large memory foot print partitions will be allocated first. – Spot check memory placement after DLPAR operation with lssrad –av and/or HMC resource dump, if fragmentation occur after DLPAR operation with performance impact, schedule service window for defragmentation.
§ If the Availability priority 127 are the same for all non-VIOS partitions (used in case of cpu failure) – Prioritize vital production partitions (1 step difference sufficient), in case of physical processor failure the Power Hypervisor will use the Availability Priority to determine which partitions to possibly stop to free up sufficient physical processors (if needed). If multiple partitions have the same value, the sub-priority is decided by the hypervisor. – 191 is default for VIOS and 127 is default for AIX partitions.
§ Max memory setting vis-à-vis minimum and desired – Ordinarily keep max virtual memory to 1.25/1.50 of desired virtual memory – Consider that maximum memory are used to calculate the Power Hypervisor maintained Hardware Page Table (HPT). Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
18
Updated
2-perspectives for shared uncapped processor partitions § Processing units – – – – –
Usage time for each and all virtual processors of a partition Capped by setting, virtual processors (uncapped) or shared processor pool Entitled capacity (desired processing units) Weight value Memory affinity by configuration
§ Virtual processor – – – – –
Ratio processing units vs virtual processors Ready to run and dispatched on a core Unfolding & Folding vpm_througput_mode (vpm_throughput_core_threashold) Memory affinity when dispatched
CPU utilization perspectives: • Utilization of cores when in use • Processing units consumed
• Available cores for use • In use Cores vs Virtual Processors ready to run
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
19
Updated
What is appropriate “average” and “peak” for sizing YOUR workload § 2-3 times daily summary every day for the whole period – 0600-1800/1800-0600 or 0800-1600/1600-2400/0000-0800 or based on actual workload
§ 2 times weekly summary – 5-days Monday-Friday – 7-days Monday-Sunday
§ One period summary should be more than 2 weeks in length, for the whole duration with the same configuration settings (ec, vp, real memory) – For each summary point, calculate CPU utilization: • • • • • • • • • • • •
min max arithmetic mean median (50-percentile) 1st and 3rd quartile (25 and 75-percentile) 90-percentile 95-percentile 1-sigma (68.27-percentile) 2-sigma (95.45-percentile) 3-sigma (99.73-percentile) Weighted average (without zero utilization) Determine MAX(mean,median)
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Frame utilization:
• Monitor max utilization over time. • Monitor impact of deploying additional workload (partitions).
Some considerations when deciding on using weighted average: • Remove all zero samples • Remove samples below specific percentile < PN% • Remove spikes (such as above Q3 • Limit samples for average between Q1 to Q3
Simplified sizing model:
• MAX(mean,median) for EC • MAX(INT(max)) for VP
w
© Copyright IBM Corporation 2015
20
Updated
Establish a weight value model – 1-10-100-255 / 1-25-255
1 10 100 255
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
21
Updated
CPU Utilization Target Thresholds (1/2) § Server Target utilization level – Target at most 90% max utilization level for 32-64 core servers, establish workload specific target levels to ensure merging active/active cluster nodes capacity within one server will not impact Reliability, Availability due to increased dispatch latency or reduced partition specific throughput.
Smoothed bezier curve, peak of the week
Business week workload dropped over weekend
Max is a spike outlier or anomaly
NOTE: This is the server physical processor pool utilization reported to HMC from the server. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
22
Updated
Considerations for CPU Utilization Target Thresholds (2/2)
IBM Work Load Estimator Performance Team Recommended Thresholds We do not recommend setting the utilization values above the IBM defaults. This can lead to the selection of a system that may experience performance problems when "spikes" in the workloads occur.
https://www-912.ibm.com/wle/EstimatorServlet?view=help Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
23
Updated
AIX partition view § Shared – Share processor partitions running below EC ordinarily have higher level of idle/wait – Share processor partitions running above EC ordinarily lower level of idle/wait
§ Dedicated – Dedicate processor partitions donate (cede) unused capacity with share_idle_procs_always partition profile attribute set, and if below 80% (ded_cpu_donate_thresh) also with keep_idle_procs attribute set – NOTE: If shared uncapped with EC equal VP, then change to dedicated processor partition or increase VP above EC to take advantage of free uncapped processing capacity
Ceding unused capacity to hypervisor (lparstat)
Idle
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
24
Updated
Single AIX shared uncap partition locked in the box
Sampling period:12/03/2013 - 12/10/2013 Sample points (n): 10774 Max: 15.96 Min: 0.09 Mean: 9.40 Standard deviation: 5.50 Variance: 2.34 Median (50%): 10.59 1st Quartile (25%): 3.99 3rd Quartile (75%): 14.55 Average (90%): 15.95 Peak (99.73%): 15.96
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Shared uncap VP=16 EC=8 Weight=128 EC:VP ratio 1:2 Average ~ Peak
© Copyright IBM Corporation 2015
25
Updated
One hump and tail Dedicated 24-core
sys% usr%
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
26
Updated
Double hump Dedicated 256-core
sys%
usr%
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
27
Updated
Multi constrained EC=9 VP=9 SMT4
sys%
usr%
pgsp<700
rq<90
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
28
Updated
Unbalanced dual Virtual I/O Servers VIOS#1 § Sample with symmetric config – – – –
Shared uncap VP=6 EC=2.0 Weight=255
VIOS#2
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
29
Updated
Unbalanced dual node cluster NODE#1 § Sample with asymmetric config – – – –
Shared uncap VP=9 / 1 EC=6.0 / 0.1 Weight=196 / 32
§ Note: – Spikes are de-duplication during backup
NODE#2
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
30
Updated
Unbalanced quad node cluster 1 & 2 fairly balanced 3 is standby by design 4 used by appdev team for testing
2
1
3
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
4
© Copyright IBM Corporation 2015
31
Updated
Processor Pool Limited by Shared Processor Pool max
Need more CPU
SPP max = 13.0 With only two (2) partitions
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
procpool:05/31/2014 - 06/01/2014 Sample points (n): 2422 Max: 13 Min: 10.58 Mean: 12.44 Standard deviation: 0.40 Median (50%): 12.54 1st Quartile (25%): 12.28 3rd Quartile (75%): 12.73
© Copyright IBM Corporation 2015
32
Updated
Shared Processor Pools (SPP) for limiting core based licenses § Shared partitions entitlement is guaranteed amount of processing units § Shared Uncapped partitions are capped at VP processing capacity – Example: 1VP = 1.0 or 100 timeslices/s, 16VP = 16.0 or 1600 timeslices/s
§ Shared Processor Pools – The shared processor pools capacity is logical and is a cap for all partitions belonging to the pool – it is not physical and are not contained to a specific number of “cores” – All logical partitions compete for the same unused physical processor capacity in the server – even though they belong to different shared processor pools. – The hypervisor distributes unused capacity among all of the uncapped shared processor partitions that are configured on the server – regardless of the shared processor pools to which they are assigned. – If no contention exists for processor resources, the virtual processors are immediately distributed across the logical partitions independent of the partitions uncapped weights. – Uncapped weight is ONLY considered when there is contention on physical cores for CPU – The partitions weight value is not a priority, nor is it a share of the processor pool for the partition. • NOTE: Look for possible updates on this topic.
http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=/iphat/iphatsharedproc.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
33
Updated
High weight value shared uncapped partitions limited by SPP max Name db1 db2
EC 4 6
VP 10 8
SPP CORELIC1 CORELIC1
Pool Max 13 13
Contention Partitions win against other partitions, but SPP max caps
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
34
Updated
Virtual I/O Server dependencies § Review, assess, evaluate: – – – – – – – – – –
Processor mode Processor entitlement Partition placement Adapters placement Virtual Ethernet pre-allocated buffers VIOS administrative network interface placement Etherchannel transmit and receive balance Ethernet adapter port and switch port flow control Absence of error conditions Uptime
https://www-912.ibm.com/wle/EstimatorServlet?view=help&topic=virtualIOHelp
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
35
Updated
VIOS Performance Advisor (1/2) The Advisor is a standalone application that polls key performance metrics for minutes or hours, before analyzing the results to produce a report that summarizes the health of the environment and proposes potential actions that can be taken to address performance inhibitors. STEP 1) Download VIOS Advisor (shipped with PowerVM IOS 2.2.2)
STEP 2) Run Executable
STEP 3) View XML File
VIOS Advisor
VIOS Partition
VIOS Partition
Only a single executable is required to run within the VIOS
The VIOS Advisor can monitor from 5min and up to 24hours, IOS 2.2.2 part command 10-60 min
Open up .xml file using a web-browser to get an easy to interpret report summarizing your VIOS status.
From PowerVM IOS 2.2.2: http://www-01.ibm.com/support/knowledgecenter/POWER7/p7hcg/part.htm Before PowerVM IOS 2.2.2: https://www.ibm.com/developerworks/wikis/display/WikiPtype/VIOS+Advisor Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
36
Updated
VIOS Performance Advisor (2/2)
§ PowerVM IOS 2.2.2 § Default sampling period up to 60 min § Run the part command on topas_nmon sampled data for a longer period 1. 2.
topas_nmon -X -s 5 -c 2880 -t -w 4 -l 150 -I 0.1 -ytype=advisor -m /tmp -youtput_dir=/tmp/perf part -f /tmp/perf/vios1.nmon
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
37
Updated
CPU Monitoring Receive early detection of potential performance inhibitors. CPU Section Snapshot
WARNING: Best practice is for VIOS to have an increased priority when in uncapped shared processor mode.
CRITICAL: Having more virtual processors assigned to a partition than what is available in the pool can lead to negative performance impacts.
LEGEND Informative
Warning
Investigate
Critical
Optimal
38Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
Updated
Memory Monitoring Receive early detection of potential performance inhibitors. Memory Section Snapshot INVESTIGATE: Advisor detects that VIOS has an excessive amount of memory and recommends customer to investigate option of reducing memory allocation.
Click on any topic to get more details, including recommended actions.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
39
Updated
Memory Monitoring Receive early detection of potential performance inhibitors.
INVESTIGATE: Advisor detects that VIOS free memory is too low to accommodate additional utilization and recommends customer to investigate option of increasing memory allocation.
Memory Section Snapshot
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
40
Updated
Disk Monitoring Receive early detection of potential performance inhibitors.
INVESTIGATE: Advisor detects that VIOS detected high I/O blocked & Long I/O latency
Memory Section Snapshot
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
41
Updated
Partition Memory Affinity & Utilization
© Copyright IBM Corporation 2015
Updated
POWER7 – POWER8 § Affinity is a measurement of the proximity a thread has to a physical resource, and performance is optimal when data crossing affinity domains is minimized.
§ POWER7 can span entire system with up to 3-hops – Two interconnects between separate planes – Workload can benefit from ASO with or w/o DSO for enhanced dynamic affinity
§ POWER8 can span entire system with up to 2-hops – Multiple interconnects between separate planes – Higher memory and I/O bandwidth – ASO is not supported and not expected to be needed
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
43
Updated
E870/E880 Memory Plug Rules & Best Practices
Default Configuration
CDIMM CDIMM CDIMM
POWER8
CDIMM
Best Memory Bandwidth
CDIMM CDIMM CDIMM CDIMM CDIMM CDIMM CDIMM CDIMM
POWER8
Performance Considerations for minimum support • System will be configured with a minimum of four CDIMMs (default) per socket • Best memory bandwidth requires fully (eight) populated CDIMM slots • Minimum CDIMM size is 16GB • 128 GB CDIMM’s bandwidth limited to 120 GB/second (eight CDIMMs) Note: Support needs to verify that there are at least four CDIMMs that are spread across the two chips Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
44
Updated
POWER7 Affinity (1/2) more sensitive than P8 § Each POWER7 chip has memory controllers that allow direct access to a portion of the memory DIMMs in the system. – Any processor core on any chip in the system can access the memory of the entire system, but it takes longer for an application thread to access the memory attached to a remote chip than to access data in the local memory DIMMs.
§ Affinity is a measurement of the proximity a thread has to a physical resource, and performance is optimal when data crossing affinity domains is minimized – Resources such as L1/L2/L3/L3.5 cache, memory, core, chip and book/node – Cache Affinity: threads in different domains need to communicate with each other, or cache needs to move with thread(s) migrating across domains – Memory Affinity: threads need to access data held in a different memory bank not associated with the same chip or node – Enhanced Affinity: OS and Hypervisor maintain metrics on a thread’s affinity and dynamically attempts to maintain best affinity to those resources (from AIX 6.1 TL05 and POWER7). • • • •
Explore Active System Optimizer (ASO) http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.optimize/optimize_kickoff.htm Exclusive use processor resource sets (XRSETs) http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.baseadmn/doc/baseadmndita/excluseprocrecset.htm Use environment variable to specify process memory placement (MEMORY_AFFINITY) http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.prftungd/doc/prftungd/aix_mem_aff_support.htm Memory allocators (MALLOC) http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.genprogc/doc/genprogc/sys_mem_alloc.htm
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
45
Updated
POWER7 Affinity (2/2) more sensitive than POWER8 POWER7/POWER7+ 795 (up to 4 chip/book) 780/770 (up to 2 chip/book)
Local Affinity (S3hrd) (memory where a thread executes) Near Affinity (S4hrd) (relative to thread) Far Affinity (S5hrd) (relative to thread)
Running hardware thread
Book/CEC/Plane DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
Book/CEC/Plane DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
Note: Simplified example, depending on architecture, and core/chip proximity to node interlinks. This schematic illustrates a Power 795 book size. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
46
Updated
Example with 9179-MHD Memory § Example with – 3.72 GHz POWER7+ SCM with eight cores each, in each CEC – Each CEC can be populated by up to 1 TB of 1066 MHz DDR3 DIMMs
Reference: http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/topic/p7ecs/p7ecs.pdf
• Two (2) CECs have 384 GB • One (1) CEC have 256 GB
|-----------|-----------------------|---------------|~~~| | Domain | Procs Units | Memory |~~~| | SEC | PRI | Total | Free | Free | Total | Free |~~~| |-----|-----|-------|-------|-------|-------|-------|~~~| | 0 | | 3200 | 1100 | 0 | 1536 | 496 |~~~| | | 0 | 800 | 100 | 0 | 512 | 69 |~~~| | | 1 | 800 | 400 | 0 | 256 | 73 |~~~| | | 2 | 800 | 0 | 0 | 512 | 184 |~~~| | | 3 | 800 | 600 | 0 | 256 | 170 |~~~| |-----|-----|-------|-------|-------|-------|-------|~~~| | 1 | | 3200 | 1300 | 50 | 1536 | 744 |~~~| | | 4 | 800 | 0 | 0 | 512 | 62 |~~~| | | 5 | 800 | 500 | 0 | 256 | 256 |~~~| | | 6 | 800 | 200 | 0 | 512 | 170 |~~~| | | 7 | 800 | 600 | 50 | 256 | 256 |~~~| |-----|-----|-------|-------|-------|-------|-------|~~~| | 2 | | 3200 | 2900 | 0 | 1024 | 430 |~~~| | | 8 | 800 | 500 | 0 | 512 | 0 |~~~| | | 9 | 800 | 800 | 0 | 0 | 0 |~~~| | | 10 | 800 | 800 | 0 | 512 | 430 |~~~| | | 11 | 800 | 800 | 0 | 0 | 0 |~~~| |-----|-----|-------|-------|-------|-------|-------|~~~|
Note: Sample resource dump edited for clarity, some columns replaced with "~~~“. startdump -m <managed-system> -t "resource" -r "hvlpconfigdata -affinity -domain -memory -procs -bus"
P3-C1 P3-C2 P3-C3 P3-C4 P3-C5 P3-C6 P3-C7 P3-C8 P3-C9 P3-C10 P3-C11 P3-C12 P3-C13 P3-C14 P3-C15 P3-C16 P3-C17 P3-C18 P3-C19 P3-C20 P3-C21
CEC DBJI987 U2C4E.001.DBJI987-P3-C1 U2C4E.001.DBJI987-P3-C2 U2C4E.001.DBJI987-P3-C3 U2C4E.001.DBJI987-P3-C4
U2C4E.001.DBJI987-P3-C7 U2C4E.001.DBJI987-P3-C8 U2C4E.001.DBJI987-P3-C9 U2C4E.001.DBJI987-P3-C10 U2C4E.001.DBJI987-P3-C11 U2C4E.001.DBJI987-P3-C12
CEC DBJJ086 U2C4E.001.DBJJ086-P3-C1 U2C4E.001.DBJJ086-P3-C2 U2C4E.001.DBJJ086-P3-C3 U2C4E.001.DBJJ086-P3-C4 Processor card regulator 5 Processor card regulator 6 U2C4E.001.DBJJ086-P3-C7 U2C4E.001.DBJJ086-P3-C8 U2C4E.001.DBJJ086-P3-C9 U2C4E.001.DBJJ086-P3-C10 U2C4E.001.DBJJ086-P3-C11 U2C4E.001.DBJJ086-P3-C12
U2C4E.001.DBJI987-P3-C18 U2C4E.001.DBJI987-P3-C19
Processor card regulator 7 TPMD card Processor card regulator 8 U2C4E.001.DBJJ086-P3-C18 U2C4E.001.DBJJ086-P3-C19
CEC DBJI983 U2C4E.001.DBJI983-P3-C1 U2C4E.001.DBJI983-P3-C2 U2C4E.001.DBJI983-P3-C3 U2C4E.001.DBJI983-P3-C4
U2C4E.001.DBJI983-P3-C7 U2C4E.001.DBJI983-P3-C8 U2C4E.001.DBJI983-P3-C9 U2C4E.001.DBJI983-P3-C10
Note: Output from lscfg-pv
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
47
Updated
Partition placement and memory affinity (Example with Power 795) GB=240 lssrad -av REF1 SRAD 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11
EC=20.0 MEM 47766.56 47792.00 47790.56 35856.00
VP=30 CPU 0-3 12-15 28-31 44-47 60-63 76-79 4-7 16-19 32-35 48-51 64-67 80-83 8-11 20-23 36-39 52-55 68-71 84-87 24-27 40-43 56-59 72-75
§ lssrad command – REF1 is separate Book/CEC/Plane – SRAD is separate Chip (Scheduler Resource Allocation Domain)
§ mpstat command
– Memory access affinity • S3hrd % of local dispatch from same chip • S4hrd % of near dispatch from same plane 7719.00 88-91 • S5hrd % of remote dispatch from different plane 7719.00 92-95 – Thread redispatch affinity 7719.00 96-99 • S0rd % thread redispatches same core thread 7470.00 100-103 • S1rd % thread redispatches same core • S2rd % thread redispatches same chip set 7221.00 104-107 • S3rd % thread redispatches same MCM 7221.00 108-111 6972.00 112-115 • S4rd % thread redispatches same Book/CEC/Plane 6972.00 116-119 • S5rd % thread redispatches different Book/CEC/Plane NOTE: mpstat -d • If a partition are spread over multiple REF1:s and/or cpu S0rd S1rd S2rd S3rd S4rd S5rd S3hrd S4hrd S5hrd SRAD:s, and mpstat show S3hrd as 100%, the partition ALL 84.7 1.7 0.0 7.3 0.0 6.3 71.8 7.4 20.8 placement do not impact application thread memory access affinity, during sample period. • Home node assignment require core:memory RSCDUMP hvlpconfigdata -affinity -domain |-----|-----|-|-|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------| • Home node dispatch is preferred | | |C|P| Domain 0 | Domain 1 | Domain 2 | Domain 3 | Domain 4 | Domain 5 | Domain 6 | Domain 7 | Inter Book/CEC/Plane should | | |O|R| Domain 8 | Domain 9 | Domain 10 | Domain 11 | Domain 12 | • Domain 13 | Domain 14 access | Domain 15 be|<5% | | |N|I| Domain 16 | Domain 17 | Domain 18 | Domain 19 | Domain 20 | Domain 21 | Domain 22 | Domain 23 | | | |T| | Domain 24 | Domain 25 | Domain 26 | Domain 27 | Domain 28 | Domain 29 | Domain 30 | Domain 31 | |-----|-----|/|/|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-| | | O |S|S| | |H|V|P| | |H|V|P| | |H|V|P| | |H|V|P| | |H|V|P| | |H|V|P| | |H|V|P| | |H|V|P| | | R |P|E| | |P|P|R| | |P|P|R| | |P|P|R| | |P|P|R| | |P|P|R| | |P|P|R| | |P|P|R| | |P|P|R| | | D |R|C| | |T|T|E| | |T|T|E| | |T|T|E| | |T|T|E| | |T|T|E| | |T|T|E| | |T|T|E| | |T|T|E| |Lp | R |D| | PUs |LMBs | | |F| PUs |LMBs | | |F| PUs |LMBs | | |F| PUs |LMBs | | |F| PUs |LMBs | | |F| PUs |LMBs | | |F| PUs |LMBs | | |F| PUs |LMBs | | |F| |-----|-----|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-| | 7| 1239|S|W| 400|000C3| | |X| 300|00090| | |X| 400|000C0| |X|X| 400|000C0| | |X| 60|0001D| | |X| 60|0001C|X| |X| 60|0001C| | |X| 60|0001D| | |X| |Score= 67| | | 65|0001E| | |X| 65|0001F| | |X| 65|0001F| | |X| 65|0001F| | |X| | | | | | | | | | | | | | | | | | | | | | Wgt= 22.12| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cust= 1| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |-----|-----|---|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
48
Updated
Partition placement and memory affinity example Memory access affinity
System A: lssrad -av REF1 SRAD 0 0 6 9 11 1 1 2 7 10 2 3 4 5 8
MEM
•
CPU •
58483.00 43805.31 37101.00 29382.00
0-3 28-31 76-79 124-127 24-27 72-75 120-123 40-43 88-91 56-59 104-107
48555.00 48306.00 37350.00 35856.00
4-7 48-51 96-99 8-11 52-55 100-103 32-35 80-83 44-47 92-95
44322.00 44322.00 44073.00 37350.00
12-15 16-19 20-23 36-39
S4hrd
•
S3hrd % of local dispatch from same chip S4hrd % of near dispatch from same Plane/Book S5hrd % of remote dispatch from different Plane/Book
S5hrd
60-63 108-111 64-67 112-115 68-71 116-119 84-87
System A: mpstat –d S0rd 86.4
S1rd 1.6
S2rd 0.0
S3rd 4.4
S4rd 0.0
S5rd 7.5
S3hrd S4hrd S5hrd 65.9 10.7 23.4
System B: lssrad -av REF1 SRAD MEM 0 0 126444.00 1 110039.31 2 110058.00 3 99102.00
CPU 0-7 16-19 32-35 48-51 64-67 80-83 96-99 112-115 8-11 24-27 40-43 56-59 72-75 88-91 104-107 120-123 12-15 28-31 44-47 60-63 76-79 92-95 108-111 124-127 20-23 36-39 52-55 68-71 84-87 100-103 116-119
S4hrd
System B: mpstat –d S0rd 87.7
S1rd 0.9
S2rd 0.0
S3rd 9.4
S4rd 0.0
S5rd 2.0
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
S3hrd S4hrd S5hrd 92.3 7.7 0.0 © Copyright IBM Corporation 2015
49
Updated
More memory needed than available per chip REF1 SRAD MEM CPU
lssrad –av REF1 0
= = = =
Book/Drawer Chip Memory Thread
curt
SRAD
MEM
CPU
0 1 2
52519.25 17181.00 30378.00
0-31 32-39 40-55
3 4 5
249.00 20916.00 0.00
56-75 76-87 88-111
6 7 8
0.00 49800.00 16434.00
112-127
9 10 11
33864.00 83118.00 13695.00
1
MB memory per Logical CPU (SMT thread) UNBALANCED
2
3
Over different books
N o C P U
Avg. Thread Affinity = 0.00 … Avg. Thread Affinity = 0.17 Avg. Thread Affinity = 0.17 Avg. Thread Affinity = 0.17 Avg. Thread Affinity = 0.17 Avg. Thread Affinity = 0.17 Avg. Thread Affinity = 0.20 Avg. Thread Affinity = 0.20 Avg. Thread Affinity = 0.20 Avg. Thread Affinity = 0.29 Avg. Thread Affinity = 0.78 Avg. Thread Affinity = 0.80 Avg. Thread Affinity = 0.82 Avg. Thread Affinity = 0.82 Avg. Thread Affinity = 0.82 Avg. Thread Affinity = 0.82 Avg. Thread Affinity = 0.83 Avg. Thread Affinity = 0.83 Avg. Thread Affinity = 0.84 Avg. Thread Affinity = 0.85 Avg. Thread Affinity = 0.85 Avg. Thread Affinity = 0.86 Avg. Thread Affinity = 0.86 Avg. Thread Affinity = 0.86 Avg. Thread Affinity = 0.87 Avg. Thread Affinity = 0.87 Avg. Thread Affinity = 0.88 Avg. Thread Affinity = 0.89 Avg. Thread Affinity = 0.89 Avg. Thread Affinity = 0.89 Avg. Thread Affinity = 0.89
Avg. Thread Affinity = 0.90 Avg. Thread Affinity = 0.92 Avg. Thread Affinity = 0.92 Avg. Thread Affinity = 0.93 Avg. Thread Affinity = 0.93 Avg. Thread Affinity = 0.93 Avg. Thread Affinity = 0.93 Avg. Thread Affinity = 0.94 Avg. Thread Affinity = 0.94 Avg. Thread Affinity = 0.94 Avg. Thread Affinity = 0.95 Avg. Thread Affinity = 0.95 Avg. Thread Affinity = 0.95 Avg. Thread Affinity = 0.96 Avg. Thread Affinity = 0.96 Avg. Thread Affinity = 0.96 Avg. Thread Affinity = 0.96 Avg. Thread Affinity = 0.96 Avg. Thread Affinity = 0.96 Avg. Thread Affinity = 0.96 Avg. Thread Affinity = 0.96 Avg. Thread Affinity = 0.97 Avg. Thread Affinity = 0.97 Avg. Thread Affinity = 0.97 Avg. Thread Affinity = 0.97 Avg. Thread Affinity = 0.97 Avg. Thread Affinity = 0.98 Avg. Thread Affinity = 0.98 Avg. Thread Affinity = 0.98 Avg. Thread Affinity = 0.98 Avg. Thread Affinity = 0.99
Fragmentation Cause Too much memory for an lpar Did not boot largest lpar first DLPARing Solution Defragment
trace -C all -L 20000000 -T 20000000 -J curt -andfo trace.raw; trcon; sleep 30; trcstop; curt -t -e -i trace.raw -o curt.out Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
50
Updated
Chips do not have memory REF1 SRAD MEM CPU
= = = =
Book/Drawer Chip Memory Thread
==== START lssrad -va Fri Oct 14 12:13:27 MEST 2011 ==== REF1
SRAD
MEM
CPU
0
98311.25
1
0.00
14-27
2
98114.00
28-41
3
0.00
42-55
4
98283.00
56-69
5
0.00
70-83
6
0.00
84-95
7
41686.00
0 0-13
1
Fragmentation Cause Chips do not have memory (no DIMMs)
2
Solution Balance the memory (add DIMMs)
3
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
51
Updated
All memory allocated to one partition ==== START lssrad -va Fri Nov 25 13:22:08 MSK 2011 ==== REF1
SRAD
MEM
CPU
0
11834.00
0-7
1
12590.00
8-15
2
0.00
16-23
3
12647.00
24-31
4
12590.00
32-39
5
12647.00
40-47
6
12647.00
48-55
7
12631.00
56-63
8
12590.00
64-71
9
12631.00
72-79
10
12631.00
80-87
11
12631.00
88-95
12
12590.00
96-103
13
12631.00
104-111
14
12688.00
112-119
15
12704.00
120-127
REF1 SRAD MEM CPU
= = = =
Book/Drawer Chip Memory Thread
0
1
2
Fragmentation Cause All memory allocated to one partition… * Hypervisor memory for partition HPT, VPT etc Solution Working as designed Allocated from the same affinity domain
3
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
52
Updated
Did not boot largest LPAR first REF1 SRAD MEM CPU
= = = =
Book/Drawer Chip Memory Thread
==== START lssrad -va Sat Nov 12 19:52:46 MSK 2011 ==== REF1
SRAD
MEM
CPU
0
40763.00
0-1
29
37856.00
30
38246.00
31
39589.00
1
49054.00
2
0.00
18-33
3
46738.00
34-49
4
49255.00
50-65
5
49054.00
66-81
6
49255.00
82-97
7
49255.00
98-113
8
49255.00
114-129
0
Fragmentation Cause Did not boot largest LPAR first Too much memory
1 2-17
Solution Defragment
2
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
53
Updated
Power 770 partition resource dump affinity table
LPAR ID #43 Spread over three (3) CEC
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
|-----------|-----------------------|---------------|------|---------------|---------------|-------| | Domain | Procs Units | Memory | | Proc Units | Memory | Ratio | | SEC | PRI | Total | Free | Free | Total | Free | LP | Tgt | Aloc | Tgt | Aloc | | |-----|-----|-------|-------|-------|-------|-------|------|-------|-------|-------|-------|-------| | 0 | | 1600 | 0 | 60 | 2048 | 898 | | | | | | 0 | | | 0 | 800 | 0 | 30 | 1024 | 418 | | | | | | 0 | | | | | | | | | 43 | 460 | 460 | 454 | 454 | | | | | | | | | | 57 | 110 | 110 | 48 | 48 | | | | | | | | | | 103 | 100 | 100 | 32 | 32 | | | | | | | | | | 105 | 100 | 100 | 32 | 32 | | | | 1 | 800 | 0 | 30 | 1024 | 480 | | | | | | 0 | | | | | | | | | 10 | | | 63 | 63 | | | | | | | | | | 34 | 30 | 30 | | | | | | | | | | | | 59 | 660 | 660 | 184 | 184 | | | | | | | | | | 87 | 30 | 30 | 168 | 168 | | | | | | | | | | 98 | 10 | 10 | 32 | 32 | | | | | | | | | | 100 | 20 | 20 | 24 | 24 | | | | | | | | | | 121 | 20 | 20 | 48 | 48 | | |-----|-----|-------|-------|-------|-------|-------|------|-------|-------|-------|-------|-------| | 1 | | 1600 | 0 | 40 | 2048 | 758 | | | | | | 0 | | | 4 | 800 | 0 | 10 | 1024 | 351 | | | | | | 0 | | | | | | | | | 10 | | | 1 | 1 | | | | | | | | | | 14 | 60 | 60 | 260 | 260 | | | | | | | | | | 15 | 10 | 10 | 104 | 104 | | | | | | | | | | 27 | 10 | 10 | 32 | 32 | | | | | | | | | | 32 | 30 | 30 | 16 | 16 | | | | | | | | | | 36 | 200 | 200 | 32 | 32 | | | | | | | | | | 46 | 20 | 20 | | | | | | | | | | | | 48 | 160 | 160 | 128 | 128 | | | | | | | | | | 71 | 310 | 310 | 48 | 48 | | | | 5 | 800 | 0 | 30 | 1024 | 407 | | | | | | 0 | | | | | | | | | 41 | 180 | 180 | 120 | 120 | | | | | | | | | | 43 | 570 | 570 | 454 | 454 | | | | | | | | | | 122 | 20 | 20 | 40 | 40 | | |-----|-----|-------|-------|-------|-------|-------|------|-------|-------|-------|-------|-------| | 2 | | 1600 | 0 | 0 | 2048 | 205 | | | | | | 0 | | | 8 | 800 | 0 | 0 | 1024 | 73 | | | | | | 0 | | | | | | | | | 1 | 200 | 200 | 32 | 32 | | | | | | | | | | 2 | 200 | 200 | 32 | 32 | | | | | | | | | | 23 | 10 | 10 | 32 | 32 | | | | | | | | | | 25 | 20 | 20 | 40 | 40 | | | | | | | | | | 34 | 60 | 60 | 93 | 93 | | | | | | | | | | 43 | 50 | 50 | 376 | 376 | | | | | | | | | | 69 | 80 | 80 | 20 | 20 | | | | | | | | | | 74 | 10 | 10 | 32 | 32 | | | | | | | | | | 93 | 20 | 20 | 64 | 64 | | | | | | | | | | 119 | 20 | 20 | 40 | 40 | | | | | | | | | | 124 | 130 | 130 | 112 | 112 | | | | 9 | 800 | 0 | 0 | 1024 | 132 | | | | | | 0 | | | | | | | | | 15 | 30 | 30 | | | | | | | | | | | | 21 | 60 | 60 | 64 | 64 | | | | | | | | | | 29 | 20 | 20 | 32 | 32 | | | | | | | | | | 32 | 20 | 20 | 16 | 16 | | | | | | | | | | 34 | 40 | 40 | 63 | 63 | | | | | | | | | | 43 | 10 | 10 | 76 | 76 | | | | | | | | | | 46 | 140 | 140 | 272 | 272 | | | | | | | | | | 87 | 30 | 30 | 168 | 168 | | | | | | | | | | 95 | 10 | 10 | 32 | 32 | | | | | | | | | | 110 | 100 | 100 | 40 | 40 | | | | | | | | | | 112 | 400 | 400 | 120 | 120 | | |-----|-----|-------|-------|-------|-------|-------|------|-------|-------|-------|-------|-------|
© Copyright IBM Corporation 2015
54
Updated
Hypervisor view of partition placement (1/2)
No memory
LPAR ID #4 is spread over more than one book which ordinarily is not optimal for partition performance
No memory
No memory
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
55
Updated
Hypervisor view of partition placement (2/2) Partition placement of previously scattered partition, after replacement action
Partition placement after ad-hoc increase of processors
Partition placement after additional replacement action
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
56
Updated
HMC>Resource Dump>hvlpconfigdata -affinity -domain (1/2) Internal tool may change at any time without notification or warning
SEC – Secondary Domain (book, drawer, node, ...) PRI – Primary Domain (chip) Procs – 100 units is one core (Total/Free per SEC and PRI) Units Free – Number of units that are free in the domain for procs/memory Memory – Max memory in LMB (Total/Free per SEC and PRI) LP – Logical Partition id Ratio – Ratio between free procs and memory in the domain Procs – 100 units is one core (Total/Free initially allocated to partition) Memory – Max memory in LMB (Total/Free initially allocated to partition)
On HMC 1. Servers > Select the managed server > Hardware > Manage Dumps > Popup window 2. Action > Initiate Resource Dump > Popup window 3. Manage Dumps • Input resource selector: hvlpconfigdata -affinity –domain • OK 4. Refresh 5. Select the radio button for the resource dump 6. Selected>Copy Dump to Remote System 1. Input remote FTP server IP-address, user ID and password, and the directory to store the resource dump file (~10KB) or access the RSDUMP.* file from the HMC /dump directory. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
57
Updated
HMC>Resource Dump>hvlpconfigdata -affinity -domain (2/2) Ordr - Order of placement from server reboot (largest partitions placed first to smallest) 65535 is special value indicating post IPL the partition was created or something was changed in the partition that cause hypervisor to try to re-place the partition. Sprd/Sec fields values - Placement that was done for the partition: C/P - Contain memory/procs in primary (chip) domain C/S - Contain in secondary (book/drawer) domain S/S - Spread across multiple secondary domains S/W - Spread wherever the partition can fit Pref - Preferred domains if needing to add resources. Internal tool
may change at any time without notification or warning
Hint to hypervisor for a specific partition with the lpar_placement profile attribute: • 0 is SCATTER (default) • 1 is PACK partition into minimum number of PU Books/CECs, indicates that the hypervisor should try and minimize the number of domains assigned to the partition • In 730 level of firmware, lpar_placement=1 was only recognized for dedicated processor partitions when SPPL=MAX. • Starting with 760 firmware level, lpar_placement=1 is also recognized for shared processor partition with SPPL=MAX and systems configured to run in TurboCore mode. • 2 is PACK partition memory and processors into the minimum number of domains, when the partition memory size can not be contained within one PU Book/CEC available from 760 firmware level. Hint to hypervisor for a specific partition by placing partitions in separate affinity groups.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
58
Updated
Defragmentation of partitions logical memory core affinity with Dynamic Platform Optimizer § To run the optimizer there must be unlicensed memory installed or available licensed memory – AIX supporting DPO: 6100-07-04, 6100-08-03, 7100-00-04, 7100-01-04, 7100-02-03, 7100-03-00
§ Check affinity scores for the managed-system – The score is a number between 0 and 100, with 0 representing the worst affinity and 100 representing perfect affinity – lsmemopt -m <managed-system> -o currscore
§ Perform calculation for a managed system – List the potential affinity score which could be attained after running a Dynamic Platform Optimization operation – lsmemopt -m <managed-system> -o calcscore
§ Start Dynamic Platform Optimization (DPO) operation – System performance will degrade during a Dynamic Platform Optimization operation, and may take a long time to complete. – When partition is shutdown, the replacement takes seconds, not minutes – For all partitions: • optmem -m <managed-system> -o start -t affinity – For specified partitions (include in the calculation and replacement): • optmem -m <managed-system> -o start -t affinity –p 'lpar1,lpar2,lpar3' – For all but specified partitions (exclude from the calculation and replacement, all other partitions not excluded will be included): • optmem -m <managed-system> -o start -t affinity –x 'lpar4,lpar5,lpar6'
§ Stop Dynamic Platform Optimization (DPO) operation – Dynamic Platform Optimization operations should not be stopped. Stopping a Dynamic Platform Optimization operation before it has completed could leave the system in an affinity state that is much worse than before the operation started. – optmem -m <managed-system> -o stop
§ Check progress of Dynamic Platform Optimization (DPO) operation – lsmemopt -m <managed-system> chsyscfg -r prof -m MS -i "name=PROF,lpar_name=LPAR,lpar_placement=1" http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/topic/p7edm/lsmemopt.html http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/topic/p7edm/optmem.html Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Optimization Priority Order 1. Partitions with lpar_pacement attribute set 2. Partitions belonging to user-defined affinity group (255-1) 3. Size of partitions based on CPU/memory resources (more = higher priority) © Copyright IBM Corporation 2015
59
Updated
DPO lsmemopt with calcscore on HMC for Power 780 with 780 system firmware
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
60
Updated
Implementation Details – Work Flow HMC
CLI “optmem” request
PHYP
Determine LPAR priority
Reference Steve Nasypany. Same high-level flow for optimization and score prediction. No LMBs or CPUs actually moved for prediction. Predicted score based on predicted virtual memory/CPU layout.
Compute preliminary optimization plan Optimize HPTs in LPAR priority order Recompute optimization plan
Hardware Page Tables (HPT) objects require contiguous LMBs. Some HPT objects may not be moved to desired location due to fragmentation caused by guarded memory, TCE tables, etc. Based on new memory layout.
Reassign CPUs to LPARs Asynchronous notification to HMC when complete. HMC retrieves status/score with separate command.
Optimize partition memory in LPAR priority order
One LMB at a time. Atomic units (relocation granules) are 512K.
Notify affected LPAR OSes Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
61
Updated
Using Affinity Groups §
Affinity Groups can be used to place multiple LPARs (allocate resources) within a book, or to separate partitions to separate books – On Power 795 with 32 cores in a book, the total physical core resources of an affinity group should not exceed 32 cores or the memory physical contained within a book – Verify with HMC resource dump: hvlpconfigdata -affinity -domain
§
The affinity_group_id represent the memory and processor affinity group in which the partition will participate. – Valid numbers are 1-255, assigning “none” removes the partition profile affinity group reference. – When the hypervisor places resources at frame reboot, it first places all the LPARs in group 255, then the LPARs in group 254, etc. – Because of this, the most important partitions with respect to affinity should be placed in the highest configured group number.
§
Assign a Affinity Group ID to partition profile: –
–
Example group #1 (255): • chsyscfg -m MANAGEDSYSTEM -r prof -i "lpar_name=LPARNAME, name=PROFILENAME, affinity_group_id=255“ Example group #2 (254): • chsyscfg -m MANAGEDSYSTEM -r prof -i "lpar_name=LPARNAME, name=PROFILENAME, affinity_group_id=254"
http://www-01.ibm.com/support/knowledgecenter/POWER7/p7edm/chsyscfg.html Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
62
Updated
Defragmentation of partitions logical memory core affinity without Dynamic Platform Optimizer §
Procedure to improve the placement through a server reboot : a. Make any profile updates that you need to make to the partitions. Ensure to set lpar_placement for key critical partitions and assign affinity groups to prioritize the other important production partitions. b. Power off all the defined partitions c. Activate (power on) all the partitions specifying the profiles with the desired attributes, the partition need to be activated to at least the SMS prompt. d. Once all partitions are simultaneously activated with the updated profiles (i.e. every partition must be active), start powering off all of the partitions. e. Power off the server f. Power on the server g. Activate the partitions from the management console. Order of activation is now not important. h. The procedure to improve the placement through a server reboot is as follows:
§
Procedure to improve the placement without a server reboot : a. Power off all the defined partitions b. Create a partition and indicate the partition profile should own all of the resources in the system. c. Activate this partition with the all resource profile to SMS menu. d. Power off this partition. e. Delete this partition. f. Activate the partitions in order of importance. The largest, most important partition should be activated first, next most important partition second and so on. If dependent on VIOS, start VIOS to SMS menu, then the largest most important partitions. Ensure to set lpar_placement for key critical partitions and assign affinity groups to prioritize the other important production partitions.
§
Procedure to change the placement for a single partition : a. Change the partitions max memory, shutdown and then activate the partition, the hypervisor release all the allocated cores/memory and tries to re-place the partition. https://www.ibm.com/developerworks/wikis/download/attachments/53871915/P7_virtualization_bestpractice.doc Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
63
Updated
Partition profile Max Memory § Memory are allocated in Logical Memory Blocks. – – – –
Each partition have a Hardware Page Table (HPT) allocated from its Online Memory. Hypervisor allocate the HPT so it can fit Max Memory references to physical (real) memory. Memory used for HPT are reserved by the hypervisor and can not be used by partitions. Dynamic LPAR (DLPAR) operations to increase a partition memory can be performed up to the specified Max Memory profile setting (if available). al 1 T to re P H k n Li D E S I R E D
LPAR
HPT
REAL
LPAR
HPT
REAL
LPAR
HPT
REAL
1
1
5984
1
1
5984
1
1
5984
2
2
4774
2
2
4774
2
2
4774
3
3
6562
3
3
6562
3
3
6562
4
4
1506
4
4
1506
4
4
1506
5
5
3730
5
5
3730
5
5
3730
6
4749
6
6
4749
7
7003
7
7
7003
8
2765
8
8
2765
9
9
7172
9
9
7172
10
10
5693
10
10
5693
11
11
1817
11
11
1817
12
12
8631
12
12
8631
6
M A X I M U M
ry 2 iscove d R A LP
7 8
DLPAR operation add 7 LMB
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
64
Updated
Partition profile Max Memory – using System Planning Tool (SPT) § Memory – before – – – –
System memory (MB): 1048576 Configured memory (MB): 430080 Hypervisor memory (MB): 26680 Unassigned memory (MB): 591616
26GB The sum of all partitions maximum memory settings are used to calculate the required hypervisor memory. Desired:Maximum ratio of 1:2
§ Memory – after – – – –
System memory (MB): 1048576 Configured memory (MB): 430080 Hypervisor memory (MB): 20736 Unassigned memory (MB): 597760
20GB – reduced by 6GB Illustration example with Maximum Memory for all partitions set to Desired:Maximum ratio of 1:1.5
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
65
Updated
AIX memory utilization § Memory is considered overcommitted if the number of pages currently in use exceeds the real memory pages available. – Over 100% utilization are over-commitment of memory which should be avoided since it usually result in paging of memory to backing storage(paging space paging or pager release and file system pagein), and paging is the slowest form of memory accesses and always reduce performance.
§ The number of pages currently in use is the sum of the: – Virtual pages – File cache pages
§ If memory is over committed, then it is recommended to either: – Reduce the workload – Add more real memory
§ Can check with: – svmon command – vmstat command
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
66
Updated
Partition memory utilization § Business critical server systems should only use paging space as a safeguard if temporarily running out of real memory for production workload. – Size application and system memory utilization to fit within the allocated partition memory. – Memory is considered overcommitted if the number of pages currently in use exceeds the real memory pages available. – Over 100% utilization are over-commitment of memory which should be avoided since it usually result in paging of memory to backing storage(paging space paging or pager release and file system pagein), and paging is the slowest form of memory accesses and always reduce performance.
IF memory utilization (((virtual+pers+clnt)/size)*100) > 100% THEN more than available real memory are utilized Example (simplified) (((17529128+0+26441721)/41943040)*100)=104.8% svmon -G memory pg space
size 41943040 41943040
inuse 41643604 2154825
free 299265
pin 9347913
pin in use
work 8053566 15201883
pers 0 0
clnt 907 26441721
other 1293440
PageSize s 4 KB m 64 KB
PoolSize -
inuse 33881332 485142
pgsp 2154825 0
pin 1959241 461792
virtual 17529128
virtual 9766856 485142
mmode Ded
Partition lpar1 lpar2 lpar3 lpar4 lpar5 lpar6 lpar7 lpar8 lpar9 lpar10
SPOT CHECK % 148.6 131.8 119.4 118.8 109.7 104.5 100.3 99.9 99.9 98.4
Only showing >98% Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
67
Updated
Memory Pools and Least Recently Used page replacement § When the file pages drops below the minperm threshold (AIX 6.1/7.1 default is 3%), the system would be in a condition to page out to paging space. – The thresholds of minperm, maxperm, maxclient are on a per memory pool basis. – There are nCPU/cpu_scale_memp memory pools on each system, and each pool has 4k and 64k pages, with the number of real memory page sets per mempool (framesets default 2), divided into LRU buckets for scanning (lrubucket default 128K). – Each pool has its own LRU daemon thread which handles page replacement, and each pool has its own set of thresholds. – The page replacement algorithm (lrud) scans through the contents of a bucket and scans the same bucket for the second pass before going on to the next bucket
§ If one pool's computational memory is so high that the number of file pages in the pool drops below minperm, then that pool could page out to paging space (even if other pools have file pages that it could steal from) – APAR IV07461: REDUCE EARLY WORKING STORAGE PAGING
§ If one page size pool (4k or 64k) are depleted (0), the other pools pages can be converted – – – –
The psmd daemon converts 4k pages to 64k pages and 64k pages to 4k pages For 64k page promotion, 16 contiguous 4k pages are required When the system is close to running out of memory, then the psmd daemon can be very busy APAR IZ90456: 64K KERNEL HEAP CAUSES PAGING WHEN RAM IS OTHERWISE AVAILABLE
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
68
Updated
Terabyte Segment Aliasing (Large Segment Aliasing – LSA) § Enabled by default in AIX 7.1, can be enabled in AIX 6.1 from TL06 § Can improve CPU performance by reducing TLB misses (segment address resolution) § Feature allows applications to use 1TB segments – 1 SLB entry in POWER7 can now address 1TB of memory. • Segment Lookaside Buffer (SLB) fault issue no longer relevant • Immediate performance boost for applications, new and legacy • If enabled, by default 1TB segments are used if the share memory is > 3GB on POWER7 and > 11GB on POWER6 or earlier • By default, LSA is disabled on AIX 6.1 TL06 and enabled in AIX 7.1
– Significant changes under the covers • New address space allocation policy • Attempts to group address space requests together to facilitate 1TB aliasing. • Once certain allocation size thresholds have been reached, OS automatically aliases memory with 1TB aliases. • 256MB segments still exist for handling IO
– Aliasing available for shared memory regions • Can be used by Oracle SGA
– Aliasing available for unshared memory regions • With Oracle release 11.2.0.3 unshared segments are also used by Oracle (in some configuration is have caused performance degradation, please consult IBM Support if enabled and performance impact are detected) • http://www-01.ibm.com/support/docview.wss?uid=isg1IV23851
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
69
Updated
AIX process memory tuning § Explore Active System Optimizer (ASO) – –
Dynamically optimize workloads in real time via continuous runtime analytics and are using AIX resource sets (rsets) for memory affinity placement (bos.aso fileset), and are started via Systems Resource Controller (SRC). http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.optimize/optimize_kickoff.htm
§ Exclusive use processor resource sets (XRSETs) – –
Exclusive use processor resource sets (XRSETs) allow administrators to guarantee resources for important work. An XRSET is a named resource set that changes the behavior of all the CPUs that it includes. Once a CPU is exclusive, it only runs programs explicitly directed to it. http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.baseadmn/doc/baseadmndita/excluseprocrecset.htm
§ Use environment variable to specify process memory placement –
At the process level, the placement of user memory can be specified with the MEMORY_AFFINITY environment variable, such as:
– –
http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.prftungd/doc/prftungd/aix_mem_aff_support.htm NOTE: The vmo -o enhanced_affinity_private=100 is an advisory option with similar effect as MEMORY_AFFINITY=MCM
• •
export MEMORY_AFFINITY=MCM export MEMORY_AFFINITY=MCM@SHM=RR
§ Memory allocators –
AIX provides different memor allocators, each of them uses a different memory management algorithm and data structures. These allocators work independently and the application developer needs to choose one of them by exporting the MALLOCTYPE environment variable (see also MALLOCOPTIONS). • Default allocator (Yorktown) – Maintains a consistent performance, even in the worst case scenario, but might not be as memory efficient as Watson allocator. Can be ideal for 32-bit applications, which do not make frequent calls to malloc(). • Watson allocator – Is selected by setting MALLOCTYPE=watson and is specifically designed for 64-bit applications; it is memory efficient, scalable and provides good performance.
•
http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.genprogc/doc/genprogc/sys_mem_alloc.htm
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
70
Updated
AIX process memory placement using resource sets (rsets) § Using AIX resource sets (rsets) to limit process processor:memory placement – Place an application on a limited number of CPUs and corresponding memory 1. Check the partitions view of memory:core distribution § lssrad –av REF1 0
SRAD
MEM
CPU
0 1 2 3
31852.19 47808.00 80178.00 95118.00
0-7 8-19 20-31 32-63
4 5 6 7
72957.00 22659.00 111552.00 47808.00
64-87 88-91 92-119 120-127
1
The mapping of logical processors to cores and chips will be altered by changing SMT mode.
2. Create one rset for each REF1 § mkrset -c 0-7 ref0/srad0 3. Start a process on a specific rset § execrset -c 0-7 -m 0 –e program123 or execrset sys/node.04.00000 –e program123 4. Attach a process to each rset, separating them to either book § attachrset ref0/srad0 100001 5. Verify rset allocation and memory access pattern (high level) § lsrset -av and lsrset -vp <PID> and svmon -P <PID> -O threadaffinity=on § topas -M and mpstat -d http://pic.dhe.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.cmds/doc/aixcmds3/lsrset.htm http://pic.dhe.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.cmds/doc/aixcmds3/mkrset.htm http://pic.dhe.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.cmds/doc/aixcmds1/attachrset.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
71
Updated
Viewing hypervisor call statistics with lparstat -H § lparstat -H displays the statistics for the Hypervisor calls – cede is when AIX gives back processor virtual processor cycles to the hypervisor. – confer is when AIX lets the hypervisor know that virtual processor’s cycles are transferred from a specified processor that it is done processing on a thread but transition the capacity allowance to another thread which has more work. – prod awakens a virtual processor that has ceded its cycles. – remove removes a Page Table Entry from the partition’s node Page Frame Table. The calls clear_ref, protect, page_init, read, bulk_remove are also operations on the page table. – xirr accepts pending interrupt – pic returns the summation of the physical processor pool’s idle cycles.
72Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
System configuration: type=Dedicated mode=Capped smt=4 lcpu=1024 mem=2019328MB Detailed information on Hypervisor Calls Hypervisor Call
Number of Calls
%Total Time Spent
%Hypervisor Time Spent
remove 789009737 read 90954539 nclear_mod 0 page_init 1255350116 clear_ref 397277 protect 2337419 put_tce 716106216 xirr 36226176062 eoi 36225386179 ipi 72062062473 cppr 194600229 asr 0 others 14164 enter 790139579 cede 75659555665 migrate_dma 0 put_rtce 0 confer 36059007012 prod 36080943570 get_ppp 1385970 set_ppp 0 purr 0 pic 1385970 bulk_remove 0 send_crq 0 copy_rdma 0 get_tce 0 send_logical_lan 0 add_logicl_lan_buf 0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.1 0.0 0.0 0.0 0.0 90.7 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.1 0.0 0.0 0.0 0.0 99.7 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Avg Call Time(ns)
Max Call Time(ns)
41492 87246812 333 3692468 0 0 2206 29470281 34561 23156859 28631 30006296 509 7284875 1364 32242875 395 34981750 938 20433671 373 28462812 0 0 908 252406 522 3251031 1088484 98635959281 0 0 0 0 73 9777328 2876 31315843 3551 496218 0 0 0 0 478 191343 0 0 0 0 0 0 0 0 0 0 0 0
Reference: Power Architecture Platform Reference (PAPR) http://openpowerfoundation.org/ https://www.power.org/ © Copyright IBM Corporation 2015
Updated
E870/E880 Hypervisor reserved memory § Disable “I/O Adapter Enlarged Capacity” to free up hypervisor memory on E770/E880 – – – –
AIX currently does not exploit this feature so there is minimal benefit of enabling enlarged capacity Linux @ http://www-01.ibm.com/support/knowledgecenter/linuxonibm/liabm/liabmconcepts.htm This is a disruptive change Please refer to technote
1. On HMC to go the server ASM interface and login. 2. Disable I/O Adapter Enlarged Capacity by unselecting the “Enable I/O Adapter Enlarged Capacity” selction box. 3. Power cycle the server (reboot).
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
73
Updated
Network I/O
© Copyright IBM Corporation 2015
Updated
Virtual Ethernet statistics (netstat/entstat) 1/2 vSWITCH <> VEN(ent4) <> SEA(ent8) <> LAGG(ent7) <> PORT(ent0,ent1,ent2,ent3) <> phyNet Device Type: Virtual I/O Ethernet Adapter (l-lan) Elapsed Time: 181 days 3 hours 14 minutes 31 seconds … Transmit Statistics: Receive Statistics: -------------------------------------Packets: 238892976799 Packets: 271043439261 Bytes: 167824988837481 Bytes: 318553513160809 Interrupts: 0 Interrupts: 102727935192 Transmit Errors: 35263985 Receive Errors: 0 Packets Dropped: 35263984 Packets Dropped: 2830194 … No Carrier Sense: 0 CRC Errors: 0 DMA Underrun: 0 DMA Overrun: 0 Lost CTS Errors: 0 Alignment Errors: 0 Max Collision Errors: 0 No Resource Errors: 2830194 … General Statistics: ------------------No mbuf Errors: 0 Adapter Reset Count: 0 Adapter Data Rate: 20000 Driver Flags: Up Broadcast Running Simplex 64BitSupport ChecksumOffload … Virtual I/O Ethernet Adapter (l-lan) Specific Statistics: --------------------------------------------------------… Hypervisor Send Failures: 678960357 Receiver Failures: 678960357 Send Errors: 0 Hypervisor Receive Failures: 2830194 … Receive Information Receive Buffers Buffer Type Tiny Small Medium Large Huge Min Buffers 512 512 128 24 24 Max Buffers 2048 2048 256 64 64 Allocated 512 512 128 24 24 Registered 512 512 128 24 24 History Max Allocated 706 2048 128 24 24 Lowest Registered 502 209 128 24 24 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
§
The Hypervisor increments the “Hypervisor Send Failures” counter every time it cannot send a packet due to a virtual Ethernet adapter (VEN) buffer shortage.
§
It also increments either the “Receiver Failure” or the “Send Errors” counter depending on where the buffer shortage occurred.
§
–
The “Receiver Failure” gets incremented in the case the partition to which the packet should be sent had no buffer available to receive the data.
–
The “Send Errors” gets incremented in the case that the sending partition is short on buffers.
The Hypervisor always increments the failure counters on both partitions if the data couldn’t be received due to a buffer shortage on the target partition.
© Copyright IBM Corporation 2015
75
Updated
Virtual Ethernet statistics (netstat/entstat) 2/2 vSWITCH <> VEN(ent4) <> SEA(ent8) <> LAGG(ent7) <> PORT(ent0,ent1,ent2,ent3) <> phyNet Device Type: Virtual I/O Ethernet Adapter (l-lan) Elapsed Time: 181 days 3 hours 14 minutes 31 seconds … Transmit Statistics: Receive Statistics: -------------------------------------Packets: 238892976799 Packets: 271043439261 Bytes: 167824988837481 Bytes: 318553513160809 Interrupts: 0 Interrupts: 102727935192 Transmit Errors: 35263985 Receive Errors: 0 Packets Dropped: 35263984 Packets Dropped: 2830194 … No Carrier Sense: 0 CRC Errors: 0 DMA Underrun: 0 DMA Overrun: 0 Lost CTS Errors: 0 Alignment Errors: 0 Max Collision Errors: 0 No Resource Errors: 2830194 … General Statistics: ------------------No mbuf Errors: 0 Adapter Reset Count: 0 Adapter Data Rate: 20000 Driver Flags: Up Broadcast Running Simplex 64BitSupport ChecksumOffload … Virtual I/O Ethernet Adapter (l-lan) Specific Statistics: --------------------------------------------------------… Hypervisor Send Failures: 678960357 Receiver Failures: 678960357 Send Errors: 0 Hypervisor Receive Failures: 2830194 … Receive Information Receive Buffers Buffer Type Tiny Small Medium Large Huge Min Buffers 512 512 128 24 24 Max Buffers 2048 2048 256 64 64 Allocated 512 512 128 24 24 Registered 512 512 128 24 24 History Max Allocated 706 2048 128 24 24 Lowest Registered 502 209 128 24 24 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
§
The Hypervisor increments the “Hypervisor Receive Failures” counter every time it cannot deliver a packet to the partition when the partition has virtual Ethernet adapter (VEN) buffer shortage.
§
Increase the amount of preallocated buffers.
§
Performance is much better when buffers are preallocated, rather than allocated dynamically when needed
© Copyright IBM Corporation 2015
76
Updated
Tracing Shared Ethernet Adapter (SEA) § Tracing Shared Ethernet Adapter (SEA) – 48F • trace -aj48F; sleep 10; trcstop; trcrpt-o trace48F.out ... ID
ELAPSED_SEC
DELTA_MSEC
APPL
SYSCALL KERNEL
48F 0.000000000 0.000000 packets_queued=562949953421312 thread_quered=0 48F 0.000003048 0.003048 seat=F1000A00000F5C00 seap=0FFFFFFFF402FDE0 48F 0.000003232 0.000184 sea=F1000E000CAB8E00 48F 0.000003712 0.000480 vtype=0000000000008100 vid=92 48F 0.000003869 0.000157 48F 0.000012753 0.008884 mbuf=F1000E000CAB8E00 48F 0.000013000 0.000247 48F 0.000013154 0.000154 seat=F1000A00000F5C00 seap=0FFFFFFFF402FDE0 48F 0.000013562 0.000408 packets_queued=0 thread_anchor=F1000A00000F5C18 48F 0.000090371 0.076809 mbuf=F1000E0001A57000 flags=0000000000000000 ...
INTERRUPT
EXAMPLE vlan 0 71 72 73 74 88 89 91 92 93
sri_pktinfo 2 30 2 113 11 33 11 11 1101 238
SEA 02 sea_thread_pkt_count acs=0353F1000A003659 thread_index=0 SEA sea_send_packet_in acs=F1000A0036590000 thread_index=2 SEA sea_real_input ndd=F1000A0036590000 mbuf=F1000A003196C068 SEA sri_pktinfo sea=F1000A0036590000 mbuf=F1000E000CAB8E00 SEA seaha_check_vid_in acs=F1000A0036590000 vid=92 SEA sri_bridged sea=F1000A0036590000 outdev=F1000A00311C26E8 SEA sea_real_input out rc=0 SEA sea_send_packet_out acs=F1000A0036590000 thread_index=2 SEA sea_thread_sleeping acs=F1000A0036590000 thread_index=2 SEA sea_input_in acs=F1000A0036590000 nddp=F1000A00311C26E8
sri_pktinfo -> packets that are received from the external switch, vid VLAN ID svi_pktinfo -> packets that are sent to the external switch, vid is VLAN ID send_rarp -> RARP packet sent to the external switch
EXAMPLE
awk '/sri_pktinfo.*vid=/{i=substr($9,index($9,"=")+1);sri[i]++} END{printf "%-4.4s\t%s\n","vlan","sri_pktinfo";for (k in sri){printf "%-4.4s\t%d\n",k,sri[k]}}' trace48F.out Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
77
Updated
Gigabit Ethernet & VIOS SEA considerations § Tuning – For optimum performance ensure adapter placement according to Adapter Placement Guide – Size VIOS to fit the expected workload, such as: • • • •
– – – – – –
AIX Partition enX
For shared uncap weight=255, EC=2.0, VP=4 For dedicated VP=2+ and share_idle_procs_always 4-8GB memory, partition placed within one domain Pre-allocate max number of buffers
On each physical adapter in the VIOS (ent) On the Etherchannel in the VIOS (ent) On the SEA in the VIOS (ent) On the virtual Ethernet adapter in the VIOS (ent) On the virtual Ethernet adapter in the AIX LPAR (ent) On the virtual network interface in the AIX LPAR (en)
NOTE: In some network environments, network and virtualization stacks, and protocol endpoint devices, other settings might apply. ► LRO is only supported by AIX LPARs. ► LRO is not supported for 1Gbps adapters.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
(VEN)
Power Hypervisor
entX (VEN)
entX
vSwitch
QoS
(VEN)
(SEA)
Virtual I/O Server entX
entX
entX
(SEA)
(LAGG)
(PORT)
Adapter placement
Network switch
Network routing
© Copyright IBM Corporation 2015
78
Updated
Gigabit Ethernet & VIOS SEA considerations § Each physical adapter in the VIOS (ent) – – – – – –
chksum_offload enabled (default) flow_ctrl enabled (default) large_send enabled (preferred) large_receive enabled (preferred) jumbo_frames enabled (optional) Verify Adapter Data Rate for each physical adapter (entstat -d/netstat -v)
§ Each switch port – Verify that flow control is enabled on LAGG adapter ports, but on switch ports accepting from but not sending to the LAGG • If you notice high values of Flow Control (XON/XOFF) received from Ethernet network switch on some adapter ports, investigate network switch load, investigate if any end-points (hosts) send. Quick fix if trottling SEA for multiple partitions, while root cause is being investigated, disable port sending XOFF to LAGG adapter ports.
– Verify that Rapid Spanning Tree Protocol (IEEE 802.1w) is enabled • RSTP (IEEE 802.1w/802.1D-2004) can achieve much faster convergence in a properly configured network than STP, sometimes in the order of a few hundred milliseconds. • If RSTP is not available, evaluate vendor specific options, such as portfast option allows the switch to immediately forward packets on the port without first completing the Spanning Tree Protocol (which blocks the port until it is finished).
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
79
Updated
Gigabit Ethernet & VIOS SEA considerations § Link Aggregation in the VIOS (ent) – – – –
Load Balance mode (let secondary VIOS act as NIB) hash_mode to src_dst_port (preferred) mode to 8023ad (preferred) Unbalanced use_jumbo_frame enabled (optional) outgoing transmit entstat / netstat -v
LAGG ent8
Real ent0 ent1 ent4 ent5
Transmit 1,282,584 12,579,150 138,345 1,390,994
Receive 18,485,385 32,382,013 1,368,455 313,396
Balance Transmit Receive 8.3% 35.2% 81.7% 61.6% 0.9% 2.6% 9.0% 0.6%
hash_mode for determining how outgoing adapter is chosen – Default use only IP address. – To improve transmit balance, set hash mode on link aggregation to src_dst_port – The outgoing adapter path (Transmit) is selected by an algorithm using the combined source and destination TCP or UDP port values. – Since each connection has a unique TCP or UDP port, the three port-based hash modes provide additional adapter distribution flexibility when there are several, separate TCP or UDP connections between an IP address pair. – To improve receive balance, consider deploying network based load balancing. – http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.commadmn/doc/commadmndita/etherchannel_load balance.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
80
Updated
Link Aggregation in the VIOS (ent) Balanced transmit, investigate network load balance
1 ent9 adapter_names ent1,ent2,ent3 1 ent9 backup_adapter NONE 1 ent9 hash_mode src_dst_port 1 ent9 mode 8023ad 1 ent9 netaddr 0 VIOS1
Balance
Lagg
Real
Transmit
Receive
Transmit
Receive
ent9
ent1
79214180124
12147461234
33.8%
6.3%
ent2
78379412655
182178907801
33.4%
93.7%
ent3
76749640578
526645
32.8%
0.0%
VIOS2
Balance
Lagg
Real
Transmit
Receive
Transmit
Receive
ent10
ent1
5
8
33.3%
11.6%
ent2
5
53
33.3%
76.8%
ent3
5
8
33.3%
11.6%
2 ent10 adapter_names ent1,ent2,ent3 2 ent10 backup_adapter NONE 2 ent10 hash_mode src_dst_port 2 ent10 mode 8023ad 2 ent10 netaddr 0
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
SEA ent11 State: LIMBO LAGG ent10 Driver Flags: Limbo Adapter ent1-3 Link Status: UNKNOWN Media Speed Running: Unknown IEEE 802.3ad Partner: OUT_OF_SYNC
ERROR
© Copyright IBM Corporation 2015
81
Updated
Gigabit Ethernet & VIOS SEA considerations § SEA in the VIOS (ent) – – – – –
Verify different VEN Trunk priority for each of the dual VIOS PVID large_receive enabled (preferred) largesend enabled (preferred) – ON/OFF for the virtualization stack jumbo_frames enabled (optional) netaddr set for primary VIOS (preferred for SEA w/failover) • Use base VLAN (tag 0) to ping external network address (beyond local) • Avoid using a router virtual IP address to ping (if its response time might fluctuate)
– Consider disabling SEA thread mode for SEA only VIOS – Share load for multiple IEEE VLAN virtual Ethernet trunk adapters between VIOS, either by manually setting alternating trunk priority for respective VIOS adapters, or leverage Load Sharing by SEA • http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=/p7hb1/iphb1_vios_scenario_sea_load_shari ng.htm • http://www-01.ibm.com/support/docview.wss?uid=isg3T7000527 SEA ha_mode – Both primary and secondary VIOS are at version 2.2.1.0 or later – Two or more trunk adapters are configured for the primary and secondary SEA pair – The IEEE VLAN definitions of the trunk adapters must be identical for the primary and backup – When enabled, primary and the backup SEA negotiate the set of virtual local area network (VLAN) IDs they will bridge, and after successful negotiation, each SEA bridges the assigned trunk adapters and the associated IEEE VLANs – Configure as for SEA with failover (ha_mode=auto), if a failure occurs while load sharing, the active SEA bridges all trunk adapters and the associated IEEE VLANs – Test and verify by swap-over (ha_mode=standby) – Enable load sharing (ha_mode=sharing) Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
82
Updated
Gigabit Ethernet & VIOS SEA considerations § Virtual network interface in the VIOS (ent) – chksum_offload enabled (default) – In high load conditions, the VEN buffer pool management of adding and reducing the buffer pools on demand can introduce latency of handling packets (and can result in drop of packets, “Hypervisor Receive Failures”). – Setting the “Min Buffers” to the same value as “Max Buffers” allowed will eliminate the action of adding and reducing the buffer pools on demand, but it will use more pinned memory, size the VIOS memory accordingly. – For VIOS with high Virtual Ethernet buffer utilization, set max to max allowed max, and min to 50-100% of max: • chdev -l ent# -a min_buf_tiny=4096 -a max_buf_tiny=4096 -a min_buf_small=4096 a max_buf_small=4096 -a min_buf_medium=2048 -a max_buf_medium=2048 -a min_buf_large=256 -a max_buf_large=256 -P
Virtual Adapter Buffer Information Buffer Max Size Buffers 512
4096
2048
4096
Medium
16384
2048
Large
32768
256
Huge
65536
128
Tiny Small
Hypervisor Send Failures (entstat/netstat) – The Hypervisor increments the “Hypervisor Send Failures” counter every time it cannot send a packet due to a virtual Ethernet adapter (VEN) buffer shortage. It also increments either the “Receiver Failure” or the “Send Errors” counter depending on where the buffer shortage occurred. – The “Receiver Failure” gets incremented in the case the partition to which the packet should be sent had no buffer available to receive the data, and the Hypervisor can not deliver the data. – The “Send Error” gets incremented in the case that the sending partition is short on buffers. – The Hypervisor always increments the failure counters on both partitions if the data couldn’t be received due to a buffer shortage on the target partition. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
83
Updated
Gigabit Ethernet & VIOS SEA considerations § Virtual Ethernet adapter in the virtual client/partition (ent) – chksum_offload enabled (default) – In high load conditions, the VEN buffer pool management of adding and reducing the buffer pools on demand can introduce latency of handling packets (and can result in drop of packets, “Hypervisor Receive Failures”). – Monitor utilization with enstat -d or netstat -v • If “Max Allocated” is higher than “Min Buffers”, increase to higher value than “Max Allocated” or to “Max Buffers”, e.g: • Increase the "Min Buffers“ to be greater than "Max Allocated" by increasing it up to the next multiple of 256 for "Tiny" and "Small" buffers, by the next multiple of 128 for "Medium" buffers, by the next multiple of 16 for "Large“ buffers, and by the next multiple of 8 for "Huge" buffers. • Or set max to max allowed max, and min to 50-100% of max, such as: • chdev -l ent# -a min_buf_tiny=4096 -a max_buf_tiny=4096 -a min_buf_small=4096 -a max_buf_small=4096 -a min_buf_medium=2048 -a max_buf_medium=2048 -a min_buf_large=256 -a max_buf_large=256 -P Hypervisor Receive Failures (entstat/netstat) – The Hypervisor increments the “Hypervisor Receive Failures” counter every time it cannot deliver a packet due to a virtual Ethernet adapter (VEN) buffer shortage on the local partition. It will also show up under “Receive Statistics” as “Packets Dropped” and “No Resource Errors”. Example from netstat -v ETHERNET STATISTICS (entX): Receive Buffers Buffer Type Min Buffers Max Buffers Max Allocated
Set min_buf_small to 1523 or 2048 Tiny 512 2048 512
Small 512 2048 1267
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Medium 128 256 128
Large 24 64 24
Huge 24 64 24 © Copyright IBM Corporation 2015
84
Updated
Gigabit Ethernet & VIOS SEA considerations § Virtual network interface in the virtual client/partition (en) – mtu_bypass enabled • Is the largesend attribute for virtual Ethernet from AIX 6100-07-01 7100-00-01 • If not available set with the ifconfig command, e.g: ifconfig enX largesend
– Use the device driver built-in interface specific network options (ISNO) • • • •
ISNO is enabled by default (the restricted no tunable use_isno) Device drivers have default settings, advised for most workloads Check current settings with ifconfig command, change with chdev command Can override with ifconfig command or setsockopt() options
– Set mtu to 9000 if using jumbo frames (network support required) • For streaming workload only, not small request-response
– Consider enabling network interface thread mode (dog thread) • http://pic.dhe.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/enable_thread_usa ge_lan_adapters.htm NOTE on network interface thread mode – On an SMP system, a single CPU can become the bottleneck for receiving packets from a fast adapter. By enabling the dog threads feature, the driver queues the incoming packet to the thread which then handles the network stack. – Enabling the dog threads feature can increase capacity of the system in some cases, where the incoming packet rate is high, allowing incoming packets to be processed in parallel by multiple CPUs. – Set with ifconfig command: ifconfig enX thread – Unset with ifconfig command: ifconfig enX -thread – Check with ifconfig command (look for “THREAD”): ifconfig enX – Check utilization with netstat command: netstat -s| grep -i thread – With the large number of hardware threads available, the incoming threads can be spread (hashed) out to too many dog threads and that can limit the performance gains if locking issues occur. Use the no command to limit, such as: – no -o ndogthreads=2 (the -r option enables after reboot). Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
85
Updated
Additional Virtual and Physical Ethernet adapter tuning § Virtual Ethernet performance enhancing ODM attribute "dcbflush_local“ – For the virtual Ethernet adapter (entN) on POWER7 systems only, check availability with: • odmget PdAt |grep -p dcbf
– AIX 6.1 and AIX 7.1 fix references: • https://www-304.ibm.com/support/docview.wss?uid=isg1IZ84165 • dcbflush_local() routine to flush cache
– Can be enabled on the virtual Ethernet adapters: • chdev -l entN -a dcbflush_local=yes
– Enabled both on the VIOC and VIOS for the virtual Ethernet adapters with the same PVID – Setting this tunable can in some workloads improve the performance of virtual Ethernet adapter up to 15% in benchmark conditions, especially for partition to partition traffic where the partitions a placed on different affinity domains.
§ Adapter No Resource Errors (entstat command statistic) – The number of incoming packets dropped by the hardware due to lack of resources. – This error usually occurs because the receive buffers on the adapter were exhausted. – Increase the adapters size of the receive buffers, e.g.1Gbps by adjusting “receive descriptor queue size” (rxdesc_que_sz) and “receive buffer pool size” (rxbuf_pool_sz), require deactivating/activating adapter. – Consider doubling rxdesc_que_sz and set rxbuf_pool_sz to two (2) times the value of rxdesc_que_sz, with chdev command, e.g: chdev -Pl enX –a rxdesc_que_sz=4096 –a rxbuf_pool_sz=8192 – http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.prftungd/doc/prftungd/adapter_stats.htm
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
86
Updated
Ethernet adapter network interface threading § Network interface threading (dog thread) – On an SMP system, a single CPU can become the bottleneck for receiving packets from a fast adapter. By enabling the dog threads feature, the driver queues the incoming packet to the thread and the thread handles calling IP, TCP, and the socket code. – The thread can run on other CPUs which might be idle. – Enabling the dog threads can increase capacity of the system in some cases, where the incoming packet rate is high, allowing incoming packets to be processed in parallel by multiple CPUs. • • • •
Set with ifconfig command: ifconfig enX thread Unset with ifconfig command: ifconfig enX -thread Check with ifconfig command (look for “THREAD”): ifconfig enX Check utilization with netstat command: netstat -s| grep hread
– NOTE: With the large number of CPU hardware threads, the incoming packet workload can be spread (hashed) out to too many dog threads and that can limit the performance gains. For example on a 32 thread CPU LPAR, consider limiting the number of dog threads with the no command, such as to 2 (instead of default 32): • no -o ndogthreads=2 (the -r option enables after reboot).
http://pic.dhe.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/enable_thread_usage_lan_adapters.htm http://pic.dhe.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.prftungd/doc/prftungd/tcp_udp_perf_tuning.htm http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.prftungd/doc/prftungd/interrupt_coal.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
87
Updated
PowerSC Trusted Firewall intervlan routing performance feature § The Trusted Firewall feature provide intervirtual LAN routing functions – By using the Shared Ethernet Adapter (SEA) and the Security Virtual Machine (SVM) kernel extension to enable the communication.
§ Provides virtualization-layer security – That improves performance and resource efficiency when communicating between different virtual LAN (VLAN) security zones on the same Power Systems server.
§ Configurable firewall within PowerVM virtualization layer of Power Systems – In this example, the goal is to be able to transfer information securely and efficiently from LPAR1 on VLAN 200 and from LPAR2 on VLAN 100.
§ Prerequisites – Trusted Firewall require Virtual I/O Server 2.2.1.4, or later with fileset powerscStd.svm Secure Virtual Machine (SVM) Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Figure 2. Example of cross-VLAN information transfer with Trusted Firewall http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.powersc/tfw503.gif
© Copyright IBM Corporation 2015
88
Updated
Using PowerSC Trusted Firewall intervlan feature § On VIOS as padmin – Initialize the Secure Virtual Machine (SVM) driver: • mksvm
– Check status of SVM (capability=0): • vlantfw -q
– Start SVM: • vlantfw -s
– Check status of SVM (capability=0): • vlantfw -q
– Display all known LPAR IP and MAC addresses: • vlantfw –d
– Create the filter rule to allow communication between the two LPARs: • Basic syntax: genvfilt -v4 -a P -z [lpar1vlanid] -Z [lpar2vlanid] -s [lpar1ipaddress] -d [lpar2ipaddress] • To allow all IPv4 traffic between two LPARs on VLAN 123 and 321: • genvfilt -v4 -a P -z 123 -Z 321 -s 172.28.1.101 –d 172.28.2.202
– Activate filter rules: • mkvfilt -v4 -u
– Display all active filter rules: • lsvfilt -a
– Verify inter-VLAN communication using Secure Virtual Machine (SVM) – Stop SVM On VIOS spot check with: • vlantfw -t
– To deactivate filter rules • All defined filter rules: rmvfilt -n all • A specific filter rule #: rmvfilt -n # Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
netstat -v seastat –d ent8 –s vlan=123 –a vlan=321 On AIX LPARs spot check with: netstat –v tcpdump –i en0 host 172.28.2.202 or iptrace © Copyright IBM Corporation 2015
89
Updated
VIOS VLAN Ethernet adapter on the SEA (1/2) VLAN(en6) v vSWITCH <> VEN(ent3&ent7) <> SEA(ent5) <> LAGG(ent2) <> PORT(ent0&ent1) <> phyNet ETHERNET STATISTICS (ent5) : Device Type: Shared Ethernet Adapter Network interface device (en6) Transmit Errors: 8708482 mtu 1500 Maximum IP Packet Size for This Device True 8708484 mtu_bypass off Enable/Disable largesend for Packets virtualDropped: Ethernet True
netaddr netmask
10.6.80.67 Internet Address 255.255.255.192 Subnet Mask
VLAN Network adapter (ent6)
base_adapter ent5 VLAN Base Adapter True vlan_priority 0 VLAN Priority True vlan_tag_id 1600 VLAN Tag ID True
SEA Network adapter (ent5) accounting ctl_chan ha_mode large_receive largesend netaddr pvid pvid_adapter real_adapter thread virt_adapters
disabled ent4 auto no 0 0 2 ent3 ent2 1 ent3,ent7
Etherchannel adapter (ent2) adapter_names hash_mode mode netaddr
8708482 = 5658828 + 3049654
True
ETHERNET STATISTICS True (ent6) : Device Type: Transmit Errors: 8708482 Packets Dropped: 8708484 ETHERNET STATISTICS (ent3) :
Device statistics Type: Virtual I/O Ethernet Adapter Enable per-client accounting of network True (l-lan) Transmit Errors: 5658828 Control Channel adapter for SEA failover True Packets Dropped: 5658830 High Availability Mode True Virtual I/O Ethernet Adapter (l-lan) Specific Statistics: Enable receive TCP segment aggregation True Hypervisor Send Failures: 5403716 Enable Hardware Transmit TCP Resegmentation True Receiver Failures: 5403716 Address to ping True Receive Information PVID to use for the SEA device True Receive Buffers Default virtual adapter to use for non-VLAN-tagged packets Tiny True Buffer Type Small Medium Large Physical adapter associated with the SEA True Min Buffers 512 512 128 24 Thread mode enabled (1) or disabled (0) True Allocated 512 523 128 24 List of virtual adapters associated with the SEA (comma separated) True History
ent0,ent1 default 8023ad 0
Max Allocated
523
889
128
EtherChannel Adapters True Determines how outgoing adapter is chosen True: ETHERNET STATISTICS (ent7) EtherChannel mode of operation Device Type: Virtual I/OTrue Ethernet Adapter (l-lan) Transmit Errors: 3049654True Address to ping
Huge 24 24
24
Packets Dropped: 3049654 Virtual I/O Ethernet Adapter (l-lan) Specific Statistics: Receive Information Receive Buffers Buffer Type Small Medium Ethernet Large It is supported to set the IP address for accessing the VIOS on the “interface (enX) associatedTiny with either the Shared Min Buffers 512 512 128 24 Adapter device or VLAN pseudo-device”. Allocated 512 520 128 24 • http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/topic/p7hb1/iphb1_vios_configuring_sea.htm History • http://www-01.ibm.com/support/docview.wss?uid=isg3T1011897 Max Allocated 512 670 128 24
24
Note: Edited for readability from entstat/netstat and lsattr commands
Huge 24 24 24
For high workload performance reasons it is not recommended ETHERNET to set IP STATISTICS addresses (ent1) and VLANs : on top of the Shared Ethernet Device Type: 10 Gigabit Ethernet Adapter (ct3) Adapter (SEA) for workload network traffic. It will result in the SEA, which is acting as an L2 bridge, to copy (bridge) network packets up the VIOS TCP/IP stack for the network interface adapter, but also separately to the(ent0) hypervisor switch. The SEA should only ETHERNET STATISTICS : Device Type: 10 Gigabit Ethernet Adapter (ct3) bridge and the Power Hypervisor vSWITCH should acts as a Ethernet Switch. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
90
Updated
VIOS VLAN Ethernet adapter on the SEA Create a separate Virtual Ethernet with the same PVID as for SEA, and if desired VLAN tagged. •
The SEA should bridge the packets.
•
The hypervisor should switch the packets.
If IP address is set on SEA with/without VLAN, packets can be bridged to all paths. •
The VIOS network interface with the IP address will receive all traffic (w/wo VLAN).
Set VIOS admin IP address here
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
91
Updated
TCP SEND and RECV buffer space with ISNO enabled § When ISNO (Interface Specific Network Options) is enabled, the global setting have no impact § Order of using values: 1. 2. 3. 4. 5.
Socket API settings (setsockopt()) Network interface (ifconfig enX attribute value) Network interface (chdev -l enX -a attribute=value) Network interface (default by device driver) OS global settings (no command)
When changing in runtime, perform in two steps: 1. 2.
ifconfig enX tcp_sendspace <size> chdev -l enX tcp_sendspace=<size> -P
NOTE: This will only affect new sockets, for open sockets application restart is usually required (to re-open the sockets).
ca001l01 en0: flags=1e080863,4c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACT IVE),LARGESEND,CHAIN> inet 172.28.15.143 netmask 0xffffff80 broadcast 172.28.15.255 tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1 en1: flags=1e080863,4c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACT IVE),LARGESEND,CHAIN> inet 172.18.190.85 netmask 0xffffff00 broadcast 172.18.190.255 This is used tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1
no -L tcp_recvspace tcp_sendspace
Current 653900 1277K
Default 16K 16K
This is NOT used
http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/interface_network_opts.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
92
Updated
TCP LARGESEND § Benefits of Large Send Offload (LARGESEND): – The TCP large send offload option allows the AIX TCP layer to build a TCP segment up to 64 KB long. The adapter sends the segment in one call down the stack through IP and the Ethernet device driver. The adapter then breaks the message into multiple TCP frames to transmit data on the cable (MTU size) . To enable large send offload:
Missing LARGESEND ca001l17
• From AIX 6.1 TL7 SP1 or AIX7.1 SP1: • Verify LARGESEND is enabled on VIOS SEA, if not enable: • chdev -dev <SEA> -attr largesend=1 • Enable LARGESEND for each Network Interface • Immediate: ifconfig en0 largesend • For after reboot: chdev -l en0 -a mtu_bypass=on -P • Before AIX 6.1 TL7 SP1 or AIX7.1 SP1: • Set LARGESEND with the ifconfig command for each Network Interface, after partition boot in /etc/rc.net or equiv by init.
en1: flags=1e080863,480<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACT IVE),CHAIN> inet 172.18.190.92 netmask 0xffffff00 broadcast 172.18.190.255
ca001l20 en1: flags=1e080863,480<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACT IVE),CHAIN> inet 172.18.190.88 netmask 0xffffff00 broadcast 172.18.190.255
http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.prftungd/doc/prftungd/tcp_large_send_offload.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
93
Updated
tcp_fastlo for faster loopback §
Enabling tcp_fastlo with no command (tcp_fastlo=1) – The transmission control protocol (TCP) fastpath loopback option is used to achieve better performance for the loopback traffic. – tcp_fastlo network tunable parameter permits the TCP loopback traffic to reduce the path length for the entire TCP/IP stack (protocol and interface), and when enabled the TCP loopback traffic is handled similarly to the UNIX domain implementation. – The TCP fastpath loopback traffic is accounted for in separate statistics by the netstat command, when the TCP connection is open, it is not accounted to the loopback interface. •
netstat -s -p tcp | grep ”fastpath loopback connections”
–
The TCP fastpath loopback does use the TCP/IP and loopback device to establish and terminate the fast path connections, therefore these packets are accounted for in the normal manner.
–
NOTE: This is for TCP only, with fragmented IP packets on loopback (lo0), for UDP you can increase the MTU size on lo0 from default 16K to 65415 which can reduce IP fragmentation (chdev -l lo0 -a mtu=65415)
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
94
Updated
Some notes on netstat § Look for drops, discarded, retransmits, delay acks, out of order, fragments, errors § Such as: – IP (netstat -p ip) • number of IP “packets dropped due to the full socket receive buffer”
– UDP (netstat -p udp) • number of UDP “socket buffer overflows” • number of UDP “datagrams dropped due to no socket”
– TCP (netstat -p tcp) • • • •
number of TCP packets “discarded due to listener's queue full” received number of TCP “data packets retransmitted” of “data packets” sent number of TCP “out-of-order packets” of “data packets” received number of TCP data packets received “duplicate acks” of “acks”
§ Establish base level of acceptable performance, such as: – Keep UDP “socket buffer overflows” to zero – Keep TCP “discarded due to listener's queue full” to zero – Keep TCP retransmit percentage below 0.02% Useful command options: netstat -v netstat -s netstat -m netstat -ss netstat -p <PROTOCOL> netstat -ano Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
95
Updated
TCP data packets percentage retransmitted SPOT CHECK Keep TCP data packet retransmit percentage below 0.02%, at this or higher level it is recommended to investigate and remedy cause to reduce the retransmit percentage below 0.02%. Consider the E2E network flow, including all intermediary virtual and physical networks.
The table values are calculated from IPL, using netstat -s output TCP section.
Example (edited for clarity) netstat -s ... tcp: ... 987... packets sent 234... data packets (... bytes) 12... data packets (... bytes) retransmitted
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Partition lpar1 lpar2 lpar3 lpar4 lpar5 lpar6 lpar7 lpar8 lpar9 lpar10 lpar11 lpar12 lpar13 lpar14 lpar15 lpar16 lpar17 lpar18 lpar19 lpar20 lpar21 ….
Data packets sent 999046345 112616168 5091167 3080628 545224113 912702 2801348 32694736 1064267301 80118388 1118939392 854858762 867407811 341422 628408406 262577841 65215588 63708083 128210829 41512979 75350310718
Data packets retransmitted 481684885 352597 14702 7931 989303 1173 2880 32096 821432 46839 474951 247960 217364 41 64887 20597 3377 3237 5005 1571 5
Percentage retransmitted 48.214 0.313 0.289 0.257 0.181 0.129 0.103 0.098 0.077 0.058 0.042 0.029 0.025 0.012 0.01 0.008 0.005 0.005 0.004 0.004 0
© Copyright IBM Corporation 2015
96
Updated
TCP packets discarded due to listener's queue full § The per TCP socket outstanding connection request queue length limit is specified by the parameter backlog with the listen() call. § The no parameter - somaxconn - defines the maximum queue length limit allowed on the system, so the effective queue length limit will be either backlog or somaxconn, whichever is smaller. – no -o somaxconn – no -d somaxconn – no -po somaxconn=2048
// display current setting // reset to default // set to 2K from default 1K
SPOT CHECK Partition lpar1 lpar2 lpar3 lpar5 lpar6
Packets discarded
1055359 1969626 1640751 32955 1243
§ The listen subroutine performs the following activities: a) b) c)
Identifies the socket that receives the connections. Marks the socket as accepting connections. Limits the number of outstanding connection requests in the system queue.
The table values are calculated from IPL, using netstat -s output TCP section.
Example (edited for clarity) netstat -s ... tcp: ... ...
1325588 discarded due to listener's queue full
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
97
Updated
Network IP trace § Ensure a file system with at least 1GB free space is available (for 10-20min traffic) § Start IP trace, and preferably limit what is traced if possible, such as interface and protocol (and possibly source/destination hosts): – startsrc -s iptrace -a "-i enX -P tcp nettrcf_raw“
§ Stop IP trace: – stopsrc -s iptrace
Example throughput graph illustrating a problem using GUI
§ Can create a text report: – ipreport -v nettrcf_raw > nettrcf_report
§ Can use the open source Wireshark GUI tool from – http://www.wireshark.org/download.html
§ Can use the open source Wireshark command line tool tshark, such as: – tshark.exe -R "tcp.len>1448“ -r nettrcf_raw
Example illustrating a problem using tshark
… 1005 122.895749299 10.1.1.13 -> 10.1.1.17 TCP 18890 50770 > 5001 [ACK] Seq=35742433 Ack=1 Win=32761 Len=18824 TSval=1335798940 TSecr=1334065961 1009 122.896252205 10.1.1.13 -> 10.1.1.17 TCP 23234 [TCP Previous segment lost] 50770 > 5001 [ACK] Seq=35956737 Ack=1 Win=32761 Len=23168 TSval=1335798940 TSecr=1334065961 … 98 © Copyright IBM Corporation 2015 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Updated
Storage I/O
© Copyright IBM Corporation 2015
Updated
Fibre Channel and Storage I/O flow Principle I/O flow
4. 5.
TUNABLES
Database/Application
(eg. Oracle db_block_size)
Raw LVs
3.
Application issuing reads/writes. File system receives the requests, allocates file system buffers dynamically. File system passes the requests to the LVM layer for the pinned buffers LVM then identifies the appropriate DISK device driver. The DISK driver then hands over the requests to the FC ADAPTER driver/VSCSI driver, which manage the FC adapter (HBA port) transmission.
Raw disks
1. 2.
I/O STACK
File System VMM LVM (dd)
VFC/NPIV
•
•
j2_dynamicBufferPreallocation agblksz & noatime fsbufs & psbufs pbufs
Multi-Path IO driver
AIX MPIO round_robin, shortest_queue, fail_over
Disk Device Drivers
max_transfer & queue_depth
VSCSI
max_xfer_size & num_cmd_elems
Fibre Channel Adapter Device Drivers
max_xfer_size & num_cmd_elems
Fibre Channel Adapter
#ports / #adapters
Storage Area Network Fabric
Cables, Gbics, CRC/Tx-errors, port and interlink speeds, fillwords, buffer credits, slow draining devices, ...
Disk Storage Systems
•
JFS2
Please review Storage vendor guidance, limitations and recommendations about the attribute values for FC and HDISK device tuning. Settings I/O device tuning attribute values too high or incorrectly, may have negative impact on I/O performance due to overloading of the backing storage infrastructure, and can even result in FC frames being discarded and in worst case leading to data corruption. For production workload, monitor utilization with the fcstat command for physical/virtual FC adapters, and the iostat/sar commands for virtual SCSI adapters.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
100
Updated
Blocked I/Os: lack of free psbufs or fsbufs I/O STACK Database/Application Raw LVs
Raw disks
If the VMM must wait for a free bufstructs, it puts the process on the VMM wait list before the start I/O is issued and will wake it up once a bufstruct has become available. The bufstructs are pinned memory buffers used to hold I/O requests.
File System VMM LVM (dd)
Multi-Path IO driver Disk Device Drivers Fibre Channel Adapter Device Drivers Fibre Channel Adapter Storage Area Network Fabric
Command: vmstat -v
EXAMPLE paging space I/Os blocked with no psbuf 6123513 client filesystem I/Os blocked with no fsbuf 1153328 external pager filesystem I/Os blocked with no fsbuf 341076481
Disk Storage Systems
paging space I/Os blocked with no psbuf Number of paging space I/O requests blocked because no psbuf was available (virtual memory manager layer). Tuning: Increase the number of equal size paging devices Note: Review cause for paging space paging, and ensure sufficient real memory is available for applications and system.
client filesystem I/Os blocked with no fsbuf Number of client filesystem I/O (NFS) requests blocked because no fsbuf was available (file system layer) Tuning: From AIX 6100-02 VMM/NFS will adjust dynamically NFS memory pools and memory buffers used for NFS Paging Device Table (pdt) and the attributes nfs_v#_pdts and nfs_v#_vm_bufs are thereafter restricted.
external pager filesystem I/Os blocked with no fsbuf Number of external pager client filesystem (JFS2) I/O requests blocked because no fsbuf was available (file system layer) Tuning: Use the ioo command to increase the value for the j2_dynamicBufferPreallocation attribute. The value is in 16k slabs, per filesystem. The filesystem does not need remounting. Consider doubling the current value and monitor the effect before increasing it again. • ioo -r -o j2_dynamicBufferPreallocation=32 Note: The vmstat -v command “filesystem I/Os blocked with no fsbuf” only display statistics for JFS filesystems. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
101
Updated
Blocked I/Os: lack of free pbufs per volume group I/O STACK Database/Application Raw LVs
Raw disks
If the VMM must wait for a free bufstructs, it puts the process on the VMM wait list before the start I/O is issued and will wake it up once a bufstruct has become available. The bufstructs are pinned memory buffers used to hold I/O requests.
File System VMM LVM (dd)
Multi-Path IO driver Disk Device Drivers Fibre Channel Adapter Device Drivers Fibre Channel Adapter Storage Area Network Fabric
Command: lvmo –a –v <vg>
Disk Storage Systems
vgname pervg_blocked_io_count vgname pervg_blocked_io_count vgname pervg_blocked_io_count
EXAMPLE rootvg 109893 vg1 494473 vg2 34502
pervg_blocked_io_count Number of I/O's that were blocked due to lack of free pbufs for the volume group. Tuning: Increase incrementally in steps, and overall system performance monitored at each step. Consider doubling the current value and monitor the effect before increasing it again. Change per volume group pbufs, (pv_pbuf_count), with the lvmo command, example: • lvmo -v rootvg -a pv_pbuf_count=1024 Change system global pbufs that applies to all volmegroups on the system (pv_min_pbuf), but must be set before varyon of the volmegroups, with the ioo command, example: • ioo -pa pv_min_pbuf=1024 Note: If both pv_pbuf_count and pv_min_pbuf are configured, the larger value takes precedence. Note: The vmstat -v command “pending disk I/Os blocked with no pbuf” only display statistics for the rootvg volume group for AIX 6.1, but the global for AIX 5.3. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
102
Updated
Blocked I/O: disk device driver service queues I/O STACK Database/Application
AIX 6.1/7.1 per second during interval
Raw LVs
Raw disks
If the disk device driver service queue (sqsz) is full (sqfull), the I/O request is put on a pending wait queue (wqsz) until the service queue have free slots. The disk device driver hands over the I/O request bufstructs in the service queue to the adapter device driver.
File System VMM LVM (dd)
Multi-Path IO driver Disk Device Drivers
AIX 5.3 accumulated since reset
Fibre Channel Adapter Device Drivers Fibre Channel Adapter Storage Area Network Fabric Disk Storage Systems
Command: iostat -DRTl|awk '$1~/hdisk/&&$24>0.0' Adjust disk device tunables: If avgsqsz is not zero (0), investigate if the underlying layers are limiting, and/or if the storage system is fast enough to handle the service queue, If sqfull or avgwqsz is not zero (0), increase the disk device drivers queue_depth attribute value incrementally in steps. Disk device driver tunables: • transfer_size is the maximum transfer size for disk device driver I/O requests. Set to to 0x100000 (1MB), from default 0x40000 (256KB), can reduce the IOPS by 4 times. Some disk device drivers allow coalescing of smaller I/O requests into larger. Correlate with the adapter device driver max_xfer_size and backing storage system. • queue_depth is the number of concurrent I/O requests the disk device driver can queue, if full service I/O requests will be pending. Max value is 256. When increasing, a common starting point is to double the current value evenly for all disk devices over the same adapter. Correlate with the adapter device driver num_cmd_elems.. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
EXAMPLE EXAMPLE EXAMPLE EXAMPLE disk sqfull disk sqfull avg wqsz avg sqsz hdisk27 91.6 hdisk88 47023174 hdisk19 1 hdisk19 1 hdisk26 90.8 hdisk94 11606256 hdisk20 1 hdisk20 1 hdisk28 86.4 hdisk93 10594224 hdisk26 1 hdisk26 1 hdisk19 82.6 hdisk92 7036146 hdisk27 1 hdisk27 1 hdisk20 82 hdisk82 6972949 hdisk28 1 hdisk28 1 hdisk208 76.2 hdisk91 6745470 hdisk18 15.2 hdisk83 4795208 iostat command hdisk21 15 hdisk95 4478127 sqfull is the number of hdisk204 5.4 hdisk89 2984135 times the service queue hdisk216 5.4 hdisk75 2213792 becomes full per second. hdisk212 2.4 hdisk87 2156686 avgsqsz is the average hdisk207 2.4 hdisk100 2122713 disk service queue size. hdisk203 2.1 hdisk84 2023983 avgwqsz is the average hdisk219 1.8 hdisk73 1954585 disk wait queue size. hdisk218 1.5 hdisk115 1649790 hdisk215 1.5 hdisk105 1638587 hdisk209 1.2 hdisk116 1621747 hdisk220 0.2 hdisk104 1620446 © Copyright IBM Corporation 2015
103
Updated
Blocked I/Os: FC adapter and FC device driver I/O STACK
– – – –
2.
Raw LVs
Resolve error/problem issues first, such as:
Raw disks
1.
If the adapter device driver service queue is full, the I/O request is put on a pending wait queue until the service queue have free slots. Command: fcstat
Database/Application File System VMM LVM (dd) Multi-Path IO driver Disk Device Drivers
Is "Port Speed (running)" with expected speed (e.g. 8Gps) Frames Error or Dumped (relate to Seconds Since Last Reset) Loss of Sync or Signal (relate to Seconds Since Last Reset) Invalid Tx Word Count or CRC Count (relate to Seconds Since Last Reset)
Fibre Channel Adapter Device Drivers Fibre Channel Adapter Storage Area Network Fabric Disk Storage Systems
Adjust tunables to reduce resource constraints and mitigate depletion (fcstat command): – No DMA Resource Count (device driver IO) • Tuning: Increase the FC device drivers max_xfer_size attribute value to 0x200000 (2MB), this will allow larger I/O transfers and when set to 2MB will also increase the DMA address space available to the device driver. Should be higher or equal to any disk device drivers max_transfer attribute value. – No Command Resource Count (device driver queue) • Tuning: Increase the FC device drivers num_cmd_elems attribute value (correlate with disk device drivers . Number of concurrent I/O requests the device driver can queue, if full I/O service requests will be pending. • queue_depth attribute value). Max num_cmd_elems is 4096/3200 for physical FC (8/16Gbps); and 256 for VFC/NPIV before December 2015 and 2048 from December 2015 updates. – No Adapter Elements Count (concurrent inflight IO over adapter hit the adapters limit) § Tuning: Increase the number of FC devices – with certain I/O patterns – increase effective I/O size by increasing adapter and disk device transfer size to reduce the number of adapter in-flight I/Os (IOPS) without reducing data throughput (and enable to coalesce smaller transfers, if supported by driver).
NOTE: Adjustments should be made in steps and impact monitored. Adjustments can cause PCI buffer, FC path and endpoint storage system overload, impacting system availability. Start with Vendor recommendations. If the amount of PCI bus address space available for DMA mapping is exceeded (too many extra high bandwidth adapters with all ports enabled on the same PCI bus, such as the 8Gbps FC adapter), the FC adapter driver will log an error, and one or both of the adapter ports will remain in the “Defined” state. Monitor using the errpt command. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
104
Updated
VIOS FC adapter transmit statistics Review VIOS physical FC port balancing from VFC clients Single FC adapter in use, and one port dedicated for a single partition
lpar1
vfchost vfchost0 vfchost1 vfchost2 vfchost3 vfchost4 vfchost5 vfchost6
lpar2
physloc clntid V1-C389 7 V1-C381 4 V1-C393 9 V1-C391 8 V1-C383 3 V1-C385 5 V1-C387 6
lpar3
clntname lpar1 lpar2 lpar3 lpar4 lpar5 lpar6 lpar9
lpar4
fcname fcs0 fcs0 fcs0 fcs0 fcs0 fcs0 fcs1
lpar5
Partition vios1
Port fcs0 fcs1 fcs2 fcs3
Transmit Frames 929,469,453 4,030,577,623 246 245
% 19% 81% 0% 0%
vios2
fcs0 fcs1 fcs2 fcs3
945,372,756 4,094,777,077 249 245
19% 81% 0% 0%
vios3
fcs0 fcs1 fcs2 fcs3
1,817,192,633 1,745,733,775 195 195
51% 49% 0% 0%
vios4
fcs0 fcs1 fcs2 fcs3
1,822,193,742 1,778,531,798 195 195
51% 49% 0% 0%
lpar6
lpar9
fcloc vfcname vfcloc srvslot clntslot U2C4E.001.DBJZ258-P2-C5-T1 fcs4 V7-C389 389 389 U2C4E.001.DBJZ258-P2-C5-T1 fcs2 V4-C381 381 381 U2C4E.001.DBJZ258-P2-C5-T1 fcs2 V9-C393 393 393 U2C4E.001.DBJZ258-P2-C5-T1 fcs0 V8-C391 391 391 U2C4E.001.DBJZ258-P2-C5-T1 fcs0 V3-C383 383 383 U2C4E.001.DBJZ258-P2-C5-T1 fcs6 V5-C385 385 385 U2C4E.001.DBJZ258-P2-C5-T2 fcs2 V6-C387 387 387
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
105
Updated
VIOS FC and VFC adapter transmit statistics dev fcs0
fcs1 fcs2
fcs3
hostname aix024080 aix020011 aix020020 aix020031 aix020040 aix020050 aix020061 aix024080 ... aix024080 aix020011 aix020020 aix020031 aix020040 aix020050 aix020061 aix024080 ...
inreqs outreqs ctrlreqs 5767483964 7422913592 11668 11284797 8739260 132 1977747303 3953972199 141 1130768 110788 133 239466608 174505042 26 2502620280 2445137074 172 3248159 12368928 1060 1742857439 93073464 50598
inbytes 270980860448505 21929282076 97037206004626 84033090846 27649293935924 99787874819858 42863208696 234812505468192
outbytes 128795506984376 8790047232 55027151296024 9518690620 3378143599420 51992581933568 72348515132 8715653267008
15549501663 7343443322 10844401 8739014 5567330759 3716784686 517058 105973 1123662021 152110646 6051609255 2577220519 1555516 123576 6048958690 609891442
534819270194231 8384920440 193779902257810 8293534466 42038560402700 205444813596482 10507149362 599645470310102
124871565447172 8148497408 50438982041112 9588104508 3218826988348 52541974790656 2315244860 34413226821752
1969 118 145 121 27 164 505 50693
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
DMA_errs Elem_errs Comm_errs 0 38133 0 0 38133 0 0 38133 0 0 38133 0 0 38133 0 0 38133 0 0 38133 0 0 21 0 0 0 0 0 0 0 0 0
72548 72548 72548 72548 72548 72548 72548 27606
0 0 0 0 0 0 0 0
© Copyright IBM Corporation 2015
Edited for clarity
fcstat -client
106
Updated
General considerations for fcs, fscsi & hdisk device attributes § Set FC adapter device driver attributes consistently for active adapter ports – max_xfer_size = 0x200000 = 2 MB (DMA Resource) from default 0x100000=1MB (note) • This will allow larger I/O transfers and will also increase the DMA address space available to the device driver. • NOTE: Please also review Adapter Placement guide, if too much PCI address space is requested for a PCI bus (I/O chip/PHB) the port might not be activated and a message will be reported to the partition error log if so occurs. Note: Tape only adapter ports, can be set to 0x1000000.
– num_cmd_elems = 2048 for partitions with dedicated adapters (note) if supported by storage vendor. • Number of concurrent I/O requests the FC adapter device driver can queue, if full I/O service requests will be pending.
§ Set disk device driver attributes consistently: – queue_depth = 16 or 32 (256 is max) • Number of concurrent I/O requests the disk device driver can queue, if full service I/O requests will be pending. • Also adjust FC adapter num_cmd_elems to accommodate #disk * queue_depth, to reduce the risk of a few disks with larger queue_depth, and fully utilized, hogging the FC adapter queues – increase queue_depth equally for all disks over the same adapter.
– transfer_size = 0x100000 = 1MB from default 0x40000=256KB • Maximum transfer size of disk device driver I/O requests. • Some disk device drivers allow coalescing of smaller I/O requests into larger .
§ Set FC adapter SCSI device driver attributes consistently – dyntrk = yes – fc_err_recov = fast_fail
Dynamic tracking of SAN FC port changes, such as moving a cable (15s limit). Detect path failure faster (limit the number of retries)
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
107
Updated
N_Port ID Virtualization (NPIV) considerations num_cmd_elems & max_xfer_size & lg_term_dma § With NPIV over VIOS the physical fibre channel HBA port will be a shared resource – Monitor and tune based on actual workload on VIOS and VIOC, and Storage side/Fabric utilization/load – use the Storage side load to determine settings for the FC adapter ports
§ Consider: – Ordinarily use four (4) up to eight (8) vFC adapters per VIOC with MPIO (AIX PCM round-robin, shortest_queue). – Use preferably two (2) FC adapters per VIOS for availability, and spread VIOC over separate VIOS FC adapter ports. – From VIOS 2.2.4.0 use the new rules facility: rules -o deploy -d; shutdown -restart; rules -o diff -s -d; •
It will increase num_cmd_elems for each active FC adapter port to max on VIOS (max is 4096/3200 per 8/16Gbps adapter port).
– On AIX VIOC after December 2015 and VIOS 2.2.4.0, you can increase VFC num_cmd_elems up to 2048. – On AIX VIOC prior to VIOS 2.2.4.0, use the default (200) or increase to the maximum num_cmd_elems allowed by the device driver (256). Note: APAR IV63231 change the attribute value range in ODM to match the device driver limit of 256, refer to http://www-01.ibm.com/support/docview.wss?uid=isg1IV63231
– Estimated based on all simultaneous active disk devices queue_depth or average service queue length. • num_cmd_elems represent the number of concurrent I/O requests the FC adapter device driver can queue, if depleted then additional concurrent I/O requests will be pending until some current service queue requests have been serviced (non zero value fcstat statistics for “Command Element Count” indicate occurred depletion). • NOTE: Monitor Storage side not to overload or over-utilize, allow up to 50% load utilization per redundant storage side port (to accommodate up to 100% if redundancy is temporarily unavailable).
– Increase max_xfer_size for VIOS FC adapter ports to 0x200000/2MB (DMA Resource) – this will allow larger I/O payload size and will also increase the DMA address space available for the physical adapter device driver – and start with default on VIOC (0x100000/1MB) or the same as VIOS – but not larger than on the VIOS. • NOTE: Review Adapter Placement guide, if too much PCI address space is requested for a PCI bus (I/O chip) the port might not be activated and a message will be reported to the partition error log if so occurs.
– If more than ~2-3000 target devices over each active VIOS FC adapter port – increase lg_term_dma in steps (or start by adding 50% or double up). • NOTE: If too many end point devices, some might remain in Defined state and a message will be reported to the partition error log if so occurs. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
108
Updated
FIBRE CHANNEL STATISTICS REPORT: fcs0 Device Type: 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03) (adapter/pciex/df1000f114108a0) World Wide Node Name: 0xC05076056E840030 World Wide Port Name: 0xC05076056E840030
FIBRE CHANNEL STATISTICS REPORT: fcs2 Device Type: 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03) (adapter/pciex/df1000f114108a0) World Wide Node Name: 0xC05076056E840034 World Wide Port Name: 0xC05076056E840034
Class of Service: 3 Port Speed (supported): 8 GBIT Port Speed (running): 8 GBIT Port FC ID: 0x689600 Port Type: Fabric
Class of Service: 3 Port Speed (supported): 8 GBIT Port Speed (running): 8 GBIT Port FC ID: 0x68d600 Port Type: Fabric
Transmit Statistics ------------------Frames: 4019572054 Words: 880768003072
Receive Statistics -----------------2664136691 594370957824
Transmit Statistics ------------------Frames: 3203102410 Words: 24043108864
Receive Statistics -----------------3656170066 372345942272
IP over FC Adapter Driver Information No DMA Resource Count: 0 No Adapter Elements Count: 38133
IP over FC Adapter Driver Information No DMA Resource Count: 0 No Adapter Elements Count: 72548
FC SCSI Adapter Driver Information No DMA Resource Count: 0 No Adapter Elements Count: 38133 No Command Resource Count: 0
FC SCSI Adapter Driver Information No DMA Resource Count: 0 No Adapter Elements Count: 72548 No Command Resource Count: 0
IP over FC Traffic Statistics Input Requests: 0 Output Requests: 0 Control Requests: 0 Input Bytes: 0 Output Bytes: 0
IP over FC Traffic Statistics Input Requests: 0 Output Requests: 0 Control Requests: 0 Input Bytes: 0 Output Bytes: 0
FC SCSI Traffic Statistics Input Requests: 2502620564 Output Requests: 2445137076 Control Requests: 172 Input Bytes: 99787877146386 Output Bytes: 51992581941760
FC SCSI Traffic Statistics Input Requests: 6051609568 Output Requests: 2577220532 Control Requests: 164 Input Bytes: 205444816160578 Output Bytes: 52541974864384
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
Edited for clarity
VIOS: fcstat -n <WWPN> <FCS#>
109
Updated
Virtual SCSI considerations § Virtual Small Computer Serial Interface (SCSI) devices / Virtual Target Devices (VTD) – Use the same max transfer size on VIOS and AIX partitions, and the same max transfer size for all VSCSI disks and VTD over the same virtual SCSI server adapter. – Use the same queue depth for all VSCSI disks and VTD over the same virtual SCSI server adapter. – It is not recommended to map more than 200 virtual SCSI VTD per adapter. – Consider the following for determining how many VSCSI server/client adapter pairs to configure: • • • •
All VTD LUNs with equal max transfer size All VTD LUNs from the same backend storage system When the sum of all VTD queue depths are higher than the VSCSI adapter can sustain concurrently VSCSI client adapter concurrent service queue limit is ((512-2)/(3+queue_depth)) • With default queue_depth of 3, up to 85 disks with full queues can be concurrently active over the same adapter • If more than 86 disks per VSCSI adapter,then a second set of ((512-2)/(3+queue_depth)) can be added
• Note: The smallest size max_transfer size of all disks mapped over a VSCSI adapter will be applied for all disks mapped over the same VSCSI adapter – change max_transfer size for a disk before mapping it to the desired VSCSI server side adapter.
§ To display the maximum transfer size of a physical device, use the lsdev command: – ODM settings: lsattr -El hdiskN -a max_transfer, DD settings: lsattr -Pl hdiskN -a max_transfer – Or use kdb for actual device driver settings in use: echo scsidisk hdiskN|kdb
§ Set VSCSI client adapter vscsi_path_to and vscsi_err_recov attributes – Exercise careful consideration when setting the virtual SCSI path tunables: • • • •
http://www-01.ibm.com/support/knowledgecenter/POWER7/p7hb1/iphb1_vios_disks.htm Consider setting vscsi_path_to to 30s and not default disabled (virtual SCSI path timeout). Consider setting vscsi_err_recov to fast_fail and not default delayed_fail (virtual SCSI path failure). Consider setting rw_timeout to 120 and not default disabled (virtual SCSI read write timeout)
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
110
Updated
Design pattern considerations for VSCSI priority setting § When the algorithm attribute value is failover, the paths are kept in a list. The sequence in this list determines which path is selected first and is determined by the value of the path priority attribute. A priority of 1 is the highest priority. Multiple paths can have the same priority value, but if all paths have the same value, selection is based on when each path was configured. – http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.osdevice/devpathctrlmodatts.htm
§ Pattern – Even lpar id = highest priority for first path to even VIOS (second) – Odd lpar id = highest priority for first path to odd VIOS (first)
§ Assumptions – A dual VIOS cluster is used for VSCSI – if, and only if, each AIX partition have been configured in the same order to the dual VIOS cluster nodes: the vscsi0 for all AIX partitions are connecting to the same VIOS, and the vscsi1 for all AIX partitions are connecting to the other VIOS
§ Action – If even lpar id, start with even priority (2) for the first path, if odd, start with odd priority (1) for the first path, reverse priority for the second path (use uname -L to display the lpar id if scripting) – Odd lpar id, such as 1,3,5,7,9,11,13.... • • • •
chpath chpath chpath chpath
-l -l -l -l
hdisk0 hdisk0 hdisk1 hdisk1
-p -p -p -p
vscsi0 vscsi1 vscsi0 vscsi1
-a -a -a -a
priority=1 priority=2 priority=2 priority=1
-a -a -a -a
priority=2 priority=1 priority=1 priority=2
– Even lpar id, such as 2,4,6,8,10,12.... • • • •
chpath chpath chpath chpath
-l -l -l -l
hdisk0 hdisk0 hdisk1 hdisk1
-p -p -p -p
vscsi0 vscsi1 vscsi0 vscsi1
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
NOTE: When the algorithm attribute value is round_robin, the sequence is determined by percent of I/O. The path priority value determines the percentage of the I/O that must be processed down each path. I/O is distributed across the enabled paths. A path is selected until it meets its required percentage. © Copyright IBM Corporation 2015
111
Updated
Do not forget check your PATHS § You might discovery you have a discrepancy in number of defined and enabled paths – Failed paths by lspath (lsmpio) – Offline (with Error) by dlnkmgr view in the example below § Investigate cause for less enabled than defined paths – Check zoning – Check LUN masking – Open PMR to analyze errpt Sense Data for failing paths Number of paths... aixlpar1:fscsi0:153 aixlpar1:fscsi1:153 aixlpar1:fscsi2:153 aixlpar1:fscsi3:153 aixlpar1:fscsi4:153 aixlpar1:fscsi5:153 aixlpar1:fscsi6:153 aixlpar1:fscsi7:153 # dlnkmgr view -cha ChaID Product ChaPort 00001 USP_V 1E 00002 USP_V 2D 00003 USP_V 4E 00004 USP_V 1D 00005 USP_V 2E 00006 USP_V 2H 00011 USP_V 2G 00014 USP_V 3G
IO-Count 831777845 877731983 866713766 863540109 745925604 867653448 848129310 869904854
Number of Enabled paths... aixlpar1:fscsi0:153 aixlpar1:fscsi1:153 aixlpar1:fscsi2:153 aixlpar1:fscsi3:74 aixlpar1:fscsi4:153 aixlpar1:fscsi5:153 aixlpar1:fscsi6:153 aixlpar1:fscsi7:153
IO-Errors 1295 0 0 412 6079 0 0 0
Paths OnlinePaths 149 149 149 149 149 149 149 149 149 73 149 149 149 149 149 149
# echo vfcs | kdb NAME ADDRESS fcs0 0xF1000A0034078000 fcs1 0xF1000A003407A000 fcs2 0xF1000A000015E000 fcs3 0xF1000A000015C000 fcs4 0xF1000A0000150000 fcs5 0xF1000A0034074000 fcs6 0xF1000A0034076000 fcs7 0xF1000A0034072000 # lsmap -npiv Name ------------vfchost75
STATE 0x0008 0x0008 0x0008 0x0008 0x0008 0x0008 0x0008 0x0008
HOST VIOS1 VIOS1 VIOS1 VIOS1 VIOS2 VIOS2 VIOS2 VIOS2
HOST_ADAP vfchost77 vfchost74 vfchost78 vfchost75 vfchost91 vfchost89 vfchost92 vfchost90
OPENED 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01
NUM_ACTIVE 0x0002 0x0005 0x0001 0x0000 0x0000 0x0000 0x0002 0x0002
-vadapter vfchost75 Physloc ClntID ClntName ClntOS ---------------------------------- ------ -------------- ------U9119.FHB.1237816-V3-C324 32 aixlpar1 AIX
Status:LOGGED_IN FC name:fcs50 Ports logged in:3 Flags:a<LOGGED_IN,STRIP_MERGE> VFC client name:fcs3
FC loc code:U5873.001.9SS007K-P1-C5-T1 VFC client DRC:U9119.FHB.1237816-V32-C324
http://www.ibm.com/developerworks/aix/library/au-aix-mpio/ Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
112
Updated
AIX tunables
© Copyright IBM Corporation 2015
Updated
AIX system tunables § AIX 6.1 and 7.1 default settings are best practice and recommended for most workloads, was designed and performance tested for Server workload – New tunable options can be released and first documented in Service Packs – Focus: multi-user, application and database server services, such as changed from AIX 5.3 • • • • • • •
minperm% = 3 maxperm% = 90 maxclient% = 90 lru_file_repage = 0 (reboot) strict_maxclient = 1 strict_maxperm = 0 page_steal_method = 1 (reboot)
§ AIX 5.3 (and pre) default settings were designed for primary Workstation workload – Tuning changes needed to adjust for multi-user, application and database server services – Focus: single-user and applications with large files
§ Do not change Restricted Tunables unless requested by IBM Support and Development – Open a service request with IBM Support, refer to Problem Management Record (PMR)
§ Understand what tunables impact and if some tunables override change others – Such as when ISNO is enabled, and device attribute TCP tunables are in place, and how this impact the global settings changeable with the no command
§ For specific workloads, additional tuning can be required – Leverage AIX Runtime Expert to check and set tunables – Baseline and document how changing tunables for your workload have repeatable positive impact for: • Response time, Throughput, Maximum user load, Business related metrics, …
http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/vmm_page_replace_tuning.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
114
Updated
Checking for changed AIX system tunable attributes for no vmo ioo schedo raso nfso lvmo § Simple sample script to check if the current value is different from the default value. – The -F option display Restricted Tunables – NOTE: Restricted Tunables are only to be changed upon request by IBM development or development support. – The sample script below it will not recalculate display values such as 64K to 65536, and some tunables are adjusted by the kernel from the default setting such as net_malloc_police). – Sample script to compare attributes current value with the default value: for O in no vmo ioo schedo raso nfso lvmo;do echo "CHECKING $O"; { $O -FL 2>/dev/null|| $O –L; } | awk '/^[a-z]/{if($2!=$3)print $0}'; done
Output example: sb_max sack tcp_fastlo udp_recvspace udp_sendspace ipqmaxlen
4M 1 1 1320K 132K 512
1M 0 0 42080 9K 100
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
4M 0 0 1320K 132K 512
4K 0 0 4K 4K 100
8E-1 1 1 8E-1 8E-1 2G-1
byte boolean boolean byte byte numeric
D C C C C R
© Copyright IBM Corporation 2015
115
Updated
Adapter Placement
© Copyright IBM Corporation 2015
Updated
Brief comparision POWER7 795 5803 and POWER8 E880 CEC POWER8 E880 CEC
POWER7 795 with 5803 I/O drawer Location Code P1-C1 P1-C2 P1-C3 P1-C4 P1-C5 P1-C6 P1-C7 P1-C8 P1-C9 P1-C10 P2-C1 P2-C2 P2-C3 P2-C4 P2-C5 P2-C6 P2-C7 P2-C8 P2-C9 P2-C10
Slot priority (one loop) 1 5 9 3 7 11 13 15 17 19 2 6 10 4 8 12 14 16 18 20
Put highest bandwidth adapters in the CEC slots, such as Slot the 40 Gig Ethernet adapter (EC3A/EC3B) should only be Location Slot priority Code priority installed in the internal CEC slots. Here also PCIe (separate Expansion Drawer Cable Cards loops) • CEC low-profile cards 1 P1-C1 same • Drawer full-height cards 3 P1-C2 same CEC PCIe3 5 https://www.ibm.com/support/knowledgecenter/en/9119-MME/p8eab/p8eab_87x_88x_slot_details.htm 2 P1-C7 same EMX0 PCIe3 4 https://www.ibm.com/support/knowledgecenter/en/9119-MME/p8eab/p8eab_emx0_slot_details.htm P1-C8 same 6 7 P1-C3 same 8 9 P1-C4 same 10 P1-C5 same 1 3 P1-C6 same 5 2 Many rules for optimum performance, consider limiting the total number of high bandwidth and extra-high bandwidth 4 adapters (EHB), using the following guidelines: 6 • No more than three Gb Ethernet ports per I/O chip. 7 • No more than three high bandwidth adapters per I/O chip. 8 • No more than one Extra-high bandwidth adapter per I/O chip. 9 if both ports are concurrently used then each port count as one adapter: 10 5708 10 Gb FCoE PCIe Dual Port Adapter • • •
5735 8 Gigabit PCI Express Dual Port Fibre Channel Adapter No more than one 10 Gb Ethernet port per two processors in a system. If one 10 Gb Ethernet port is present per two processors, no other 10 Gb or 1 Gb ports allowed for optimum performance. No more than two 1 Gb Ethernet ports per one processor in a system. More Ethernet adapters can be added for connectivity. Place the highest performance adapters in slots P1-C1 through P1-C6 and P2-C1 through P2-C6.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
117
Updated
NOTE
NOTE: If the amount of PCI bus address space available for DMA mapping is exceeded (too many extra high bandwidth adapters with all ports enabled on the same PCI bus, such as the 8Gbps FC adapter), the FC adapter driver will log an error, and one or both of the adapter ports will remain in the “Defined” state. Monitor using the errpt command. Overloading can also cause issues with interrupts, and at the endpoint storage side lead to increase in service times. Adjustments to FC device driver tunable attributes should be made in steps and impact monitored. Adjustments can cause PCI buffer, FC path and endpoint storage system overload, impacting system availability. Align with Vendor recommendations.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
118
Updated
Firmware and System Software
© Copyright IBM Corporation 2015
Updated
Patch and fix maintenance § Keep the system firmware current – For Power 795 >>> Upgrade from 730 level – Significant enhancements from 760 and 780 firmware – Check out HMC V8R8 performance monitoring and reporting capability (Server>Performance)
§ Keep AIX and IOS current – check for performance fixes – Plan ahead with application vendor certification
§ Establish and verify system software and firmware/microcode update strategy – Review “Service and support best practices” for Power Systems • http://www14.software.ibm.com/webapp/set2/sas/f/best/home.html
– Maintain a system software and firmware/microcode correlation matrix • http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/AH-Firmware-Hist.html
– Regularly evaluate cross-product compatibility information and latest fix recommendations (FLRT) • https://www14.software.ibm.com/webapp/set2/flrt/home
– Regularly evaluate latest microcode recommendations with Microcode Discovery Services (MDS) • http://www14.software.ibm.com/webapp/set2/mds/
– Periodically review product support lifecycles • http://www-01.ibm.com/software/support/lifecycle/index.html
– Sign up to receive IBM bulletins for security advisories, high impact issues, APARs, Techdocs, etc • http://www14.software.ibm.com/webapp/set2/subscriptions/pqvcmjd • https://www-947.ibm.com/systems/support/myview/subscription/css.wss/folders?methodName=listMyFolders • https://www-947.ibm.com/systems/support/myview/subscription/css.wss/subscriptions#help-2
– Subscribe to APAR updates, available for specific ones and related to components, such as AIX 7.1 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
120
Updated
Documentation
© Copyright IBM Corporation 2015
Updated
Best Practices documents and References: POWER: • Power Virtualization Best Practices • Active Memory Expansion Performance
IBM i: • Performance Management on IBM i • IBM i on Power – Performance FAQ • Under the Hood: Power Logical Partitions
AIX and VIOS: • AIX on Power – Performance FAQ • VIOS Sizing • IBM Power Systems Performance Report ( Enhanced rPerf )
Java:
Advisor Tools: • • • •
Workload Estimator PowerVM Virtualization Performance LPAR Advisor VIOS Advisor Java Performance Advisor
Redbooks: • • • •
PowerVM Best Practices PowerVM Managing and Monitoring PowerVM Virtualization Introduction and Configuration POWER Optimization and Tuning Guide
AIX and VIOS: • IBM i Technology Updates • Fix Central ( for Firmware, AIX and VIOS updates )
• Java Performance on Power
Databases: • AIX and Oracle Database Performance Considerations (ICC)
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
122
Updated
Thank you – Tack !
J Björn Rodén roden@ae.ibm.com http://www.linkedin.com/in/roden Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
123
Updated
Continue growing your IBM skills
ibm.com/training provides a comprehensive portfolio of skills and career accelerators that are designed to meet all your training needs.
§ Training in cities local to you - where and when you need it, and in the format you want
– Use IBM Training Search to locate public training classes near to you with our five Global Training Providers – Private training is also available with our Global Training Providers
§ Demanding a high standard of quality – view the paths to success
– Browse Training Paths and Certifications to find the course that is right for you
§ If you can’t find the training that is right for you with our Global Training Providers, we can help. – Contact IBM Training at dpmc@us.ibm.com
Global Skills Initiative
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada © Copyright IBM Corporation 2015
124
Updated
Please fill out an evaluation!
@ IBMtechU
Some great prizes to be won!
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
125
Updated
Continue growing your IBM skills
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
126
Updated
Extras
© Copyright IBM Corporation 2015
Updated
Technical question on DNS configuration §
When a process receives a symbolic name and needs to resolve it into an address, it calls a resolver subroutine. The method used by the set of resolver subroutines to resolve names depends on the local host configuration.
§
The Domain Name Protocol allows a host in a domain to act as a name server for other hosts within the domain. –
–
§
If the /etc/resolv.conf file exists, the local resolver routines either use a local name resolution database maintained by a local named daemon (a process) to resolve Internet names and addresses, or they use the Domain Name Protocol to request name resolution services from a remote DOMAIN name server host (unless order is changed by configuration of irs.conf, netsvc.conf or NSORDER environment variable). If no resolv.conf file exist than the resolver routines continue searching their direct path. The resolv.conf file can contain one domain entry or one search entry, a maximum of three nameserver entries, and any number of options entries. • timeout:n • Enables you to specify the initial timeout for a query to a nameserver. The default value is five seconds. The maximum value is 30 seconds. For the second and successive rounds of queries, the resolver doubles the initial timeout and is divided by the number of nameservers in the resolv.conf file. • attempts:n • Enables you to specify how many queries the resolver should send to each nameserver in the resolv.conf file before it stops execution. The default value is 2. The maximum value is 5. • rotate • Enables the resolver to use all the nameservers in the resolv.conf file, not just the first one.
Environment variables for process controlled domain name resolution lookup: –
–
RES_TIMEOUT • Overrides the default value of the retrans field of the _res structure, which is the value of the RES_TIMEOUT constant defined in the /usr/include/resolv.h file. This value is the base time-out period in seconds between queries to the name servers. After each failed attempt, the time-out period is doubled. The time-out period is divided by the number of name servers defined. The minimum time-out period is 1 second. RES_RETRY • Overrides the default value for the retry field of the _res structure, which is 4. This value is the number of times the resolver tries to query the name servers before giving up. Setting RES_RETRY to 0 prevents the resolver from querying the name servers.
http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.commadmn/doc/commadmndita/tcpip_nameresol.htm http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.files/doc/aixfiles/resolv.conf.htm http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.progcomm/doc/progcomc/skt_dns.htm 128 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
128
Updated
Technical question on netcd DNS configuration § The netcd daemon reduces the time taken by the local (/etc/hosts), DNS, NIS, NIS+ and user loadable module services to respond to a query by caching the response retrieved from resolvers. – When the netcd daemon is running and configured for a resolver (for example DNS) and a map (for example hosts), the resolution is first made using the cached answers. If it fails, the resolver is called and the response is cached by the netcd daemon. • • • • • • •
Start: startsrc -s netcd Stop: stopsrc -s netcd Check daemon: lssrc -l -s netcd Check cache content: netcdctrl -t hosts -e dns -a netcd.cache Check cache statistics: netcdctrl -t hosts -e dns -s netcd.stat Flush cache (all|dns|local): netcdctrl -t hosts -e dns -f Configuration file: /etc/netcd.conf • Format: cache <type_of_cache> <type_of_map> <hash_size> <cache_ttl> • Example: cache dns hosts • Default: cache all all 128 60
§ The netcd daemon is a newer alternatively to using a cache only DNS server configuration – Which only transferred after update from the DNS master to the cache only DNS server.
http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.cmds/doc/aixcmds4/netcd.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
129
Updated
Technical question on NFS mount points in root file system root directory §
Why it is not advisable to mount NFS filesystems in the root filesystem root directory – When a program traverse the root file system and scan the root directory (readdir), for each NFS mount point the attribute (user permissions) will be requested from NFS, and if it is not locally cached it will request from the NFS server for the file system (getattr). – In the worst case this can give the symptom of “hang”, due to lengthy NFS timeouts if the NFS server is unavailable, especially if the mount is hard foreground, but even if it is soft background it ordinarily lead to intermittent slowdown of the root directory scan (when the local attribute cache is invalid). – Regardless it require unnecessary network traffic, and SPOF exposure for a server production system.
§
How to do – Only use NFS mount points at sublevel from root file system root directory, such as: • /nfs/<mountpoint> – If a directory is required by application in the root directory, create a symbolic link from root directory to the /nfs/<mountpoint> with the ln -s command, such as: • ln -s /nfs/mountpoint /mountpoint
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
130
Updated
Technical question on PVID and IEEE VLAN tagging § There are two ways to achieve VLAN tagging in PowerVM A. Explicit VLAN tagging via VLAN pseudo device B. Implicit VLAN tagging via pHyp – In PowerVM, when a VIO client sends network traffic through the virtual Ethernet adapter interface, pHyp may insert VLAN tag before delivering the traffic to SEA for bridging. – Determination to insert VLAN tag is done based on the PVID of the virtual Ethernet adapter where the network traffic is originated. – If the PVID of the virtual Ethernet adapter matches the PVID of any trunk adapter on that virtual switch, then the VLAN tag is NOT inserted. Otherwise, the VLAN tag is inserted. In this case, the VLAN tag ID is the PVID of the virtual Ethernet adapter. § Examples: – PowerVM environment has one VIOS partition and two client LPAR partitions, A & B. VIOS partition has SEA configured with PVID 100 and additional VLAN 200 – VIO client LPAR A has virtual Ethernet adapter with PVID 100 – VIO client LPAR B has virtual Ethernet adapter with PVID 200 – In this case, any network traffic originating from VIO client A would be untagged – Similarly, any network traffic originating from VIO client B would be tagged with VLAN tag ID of 200. In this case, pHyp will insert VLAN tag before it delivers traffic to SEA for bridging. Correspondingly, pHyp will remove VLAN tag before delivering traffic to VIO client
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
131
Updated
Technical question on sizing VIOS § Sizing the Virtual I/O Server for the Shared Ethernet Adapter involves the following factors – Defining the target bandwidth (MB per second), or transaction rate requirements (operations per second). The target performance of the configuration must be determined from your workload requirements. – Defining the type of workload (streaming or transaction oriented). – Identifying the maximum transmission unit (MTU) size that will be used (1500 or jumbo frames). – Determining if the Shared Ethernet Adapter will run in a threaded or nonthreaded environment. – Knowing the throughput rates that various Ethernet adapters can provide (see Adapter selection). – Knowing the processor cycles required per byte of throughput or per transaction (see Processor allocation). § References: – http://public.dhe.ibm.com/common/ssi/ecm/en/poo03017usen/POO03017USEN.PDF – http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=/p7hb1/iphb1_vios_planning_cap.htm
Description
Limit
Maximum virtual Ethernet adapters per LPAR
256
Maximum number of VLANs per virtual adapter
21 VLAN (20 VID, 1 PVID)
Number of virtual adapter per single SEA sharing a single physical network adapter
16
Maximum number of VLAN IDs
4094
Maximum virtual Ethernet frame size
65,408 bytes
Maximum number of physical adapters in a link aggregation
8 primary, 1 backup
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
132
Updated
Technical question on Shared Ethernet Adapter hidden ctrl channel Client LPAR
VIOS#1
VIOS#2
Client LPAR
ent1, a virtual adapter for the VIOS admin IP config – isolation from SEA config
ent4 SEA
ent0
ctrlchan hidden
ent2 x
ent4 SEA IP Addr ent1 x
IP Address VLAN x ent0 x
IP Address VLAN x ent0 x
IP Addr ent1 x
ent2 x
ctrlchan hidden
ent0
Control Channel VLAN 4095 Physical adapter ent0 may be an aggregation of adapters
mkvdev –sea ent0 –vadapter ent2 –default ent2 –defaultid x –attr ha_mode=auto
Ethernet Switch VLAN x http://www-01.ibm.com/support/docview.wss?uid=isg1IV37193 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
133
Updated
Technical question on Tracing Virtual Processor Management (VPM) § Tracing Virtual Processor Management (VPM) by scheduler (swapper PID 0) – 63C • trace -aj63C; sleep 10; trcstop; trcrpt-o trace63C.out
001
0.000000000
... 63C 3.100511638 63C 3.102152562 63C 3.105167898 63C 3.105549792 63C 3.107791880 63C 3.210496353 new_cpu=0010 63C 3.648087578 ...
0.000000
TRACE ON channel 0 Wed Oct 10 15:38:03 2012
0.000185 1.640924 3.015336 0.381894 2.242088 102.704473
VPM VPM VPM VPM VPM VPM
437.591225
VPM sleep: cpu=001A
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
vpm_fold_cpu: cpu=0017, gpp=0014 sleep: cpu=0016 sleep: cpu=0014 sleep: cpu=0017 sleep: cpu=0015 sched timer: srad=0000, old_cpu=000C,
© Copyright IBM Corporation 2015
134
Updated
Technical question on Tracing the Hypervisor § Tracing hypervisor dispatch – 419 • trace -aj419; sleep 10; trcstop; trcrpt-o trace419.out 001
0.000000000
0.000000
419
0.022497994
22.497994
419
1.400330103
1377.832109
TRACE ON channel 0 Wed Oct 10 15:38:58 2012 Virtual CPU preemption/dispatch data Preempt: Timeout, Dispatch: Timeslice vProcIndex=0021 rtrdelta=0.000 us enqdelta=7.333 us exdelta=7.812 us start wait=0.000000 ms end wait=0.000000 ms SRR0=0000000000000500 SRR1=8000000000001000 dist: local srad=0 assoc=1 Virtual CPU preemption/dispatch data Preempt: Timeout, Dispatch: Timeslice vProcIndex=0024 rtrdelta=0.000 us enqdelta=6.535 us exdelta=8.044 us start wait=1399.898056 ms end wait=1399.912640 ms SRR0=000000000000D2AC SRR1=800000000000F032 dist: local srad=0 assoc=1
...
rtrdelta - time between when thread blocked and event made them ready to run (ex. waiting on disk op) enqdelta - time between ready to run and when thread had entitlement to run exdelta - time between waiting for entitlement and when hypervisor found an idle physical processor to dispatch SRR0 - Next Instruction Address where OS was executing when cede/preempt SRR1 - Portions of machine state register where OS was executing when cede/preempt
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
135
Updated
POWER7 Power Mode Setup: Energy Scale Functions – BOOST !
IBM EnergyScale for POWER7 Processor-Based Systems, March 2013 http://public.dhe.ibm.com/common/ssi/ecm/en/pow03039usen/POW03039USEN.PDF EnergyScale Performance Characteristics for IBM POWER7 Systems, April 2010 http://public.dhe.ibm.com/common/ssi/ecm/en/pow03042usen/POW03042USEN.PDF Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
136
Updated
Dedicated processor mode partition and sharing_mode § Frame/CEC shared with multiple active partitions, not single use Frame/CEC § Dedicated processor partitions spreads tasks across cores to improve individual task’s response time – Higher throughput and lower latency for the partition, however can reduce the servers throughput total since some resources are reserved and are not shared. – However, AIX will try to optimize for higher total server throughput by folding
§ If all dedicated and not shared processors – If no other partition can make use the processing cycles ceded to the Hypervisor, and do not need active processor sharing (donation), then can disable cede – Set sharing_mode to keep_idle_procs – Or in AIX • schedo -p -o ded_cpu_donate_thresh=0 • schedo -p -o smt_snooze_delay=-1
§ If all shared or mixed, for dedicated: – Allow active partition processor sharing (donation) – Donating and ceding of unused processing capacity can benefit total server workload – Set sharing_mode to share_idle_procs_always
Allow when partition is active Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
137
Updated
SMT behavior when disabling cede to maximize spread § Dedicated partition SMT4 with default settings – share_idle_procs
§ AIX stops scheduling on virtual processors hardware threads, aka folding, is decided by the scheduler (swapper PID 0) every second. § Can use schedo command to enable or disable folding – vpm_fold_policy
The same workload test and period
§ Same dedicated partition with default settings and – share_idle_procs – ded_cpu_donate_thresh=0 – smt_snooze_delay=-1
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
138
Updated
vpm_throughput customer test
© Copyright IBM Corporation 2015
Updated
Delay processor unfolding with schedo vpm_throughput_mode § Leverage scaled throughput mode for delayed processor unfolding § Values: – Default: 0 – Range: 0 – 4 (8 with POWER8) – Type: Dynamic
§ Tuning: – The throughput mode determines the desired level of SMT exploitation on each virtual processor core before unfolding another core. – A higher value will result in fewer cores being unfolded for a given workload. – This increases scaled throughput at the expense of raw throughput. – A value of zero disables this option in which case the default (raw throughput) mode will apply, which: • Spread to primary core thread first, before using secondary and tertiary threads. • Pack software threads onto fewer virtual processors and increase the runtime length of threads on fewer virtual processors, by cede or confer of remaining entitled processing capacity.
– – – – –
vpm_throughput_mode=0 (default raw throughput mode) vpm_throughput_mode=1 (optimized VP folding) vpm_throughput_mode=2 (fill two LP on VP before unfolding additional VP) vpm_throughput_mode=4 (fill four LP on VP before unfolding additional VP) POWER8 ONLY: vpm_throughput_mode=8 (fill eight LP on VP before unfolding additional VP)
§ NOTE: – The schedo vpm_throughput_core_threshold tunable can be set to specify the number of VPs that must be unfolded before the vpm_througput_mode tunable will come into use. With vpm_throughput_mode set to 4 and VP>EC, set vpm_throughput_core_threshold to EC processing units rounded. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
140
Updated
Example using vpm_througput_mode on POWER7 (1/4) § Baseline with vpm_throughput_mode=0 (default raw throughput mode) REF1 SRAD MEM CPU" 0 0 36829.00 0-3 8-11 16-19 24-27 32-35 40-43 48-51 56-59 68-71 80-83 92-95 104-107 116-119 128-131 140-143 152-155 1 36817.94 4-7 12-15 20-23 28-31 36-39 44-47 52-55 60-63 72-75 84-87 96-99 108-111 120-123 132-135 144-147 156-159 2 19402.19 64-67 76-79 88-91 100-103 112-115 124-127 136-139 148-151"
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Sys% 2.5 4.8 2.0
Wait% 0.0 0.0 37.5
Idle% 8.7 12.8 1.5
CPU% 19.1 28.1 1.5
PhysCPU 7.6 11.3 1.5
CPU101
CPU089
User% 7.9 12.5 1.6
CPU077
CPU065
VP_CPU: Avg Max Max:Avg
© Copyright IBM Corporation 2015
142
Updated
Example using vpm_througput_mode on POWER7 (2/4) § vpm_throughput_mode=1 (optimized VP folding) REF1 SRAD MEM CPU" 0 0 36829.00 0-3 8-11 16-19 24-27 32-35 40-43 48-51 56-59 68-71 80-83 92-95 104-107 116-119 128-131 140-143 152-155 1 36817.94 4-7 12-15 20-23 28-31 36-39 44-47 52-55 60-63 72-75 84-87 96-99 108-111 120-123 132-135 144-147 156-159 2 19402.19 64-67 76-79 88-91 100-103 112-115 124-127 136-139 148-151"
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Sys% 2.3 4.2 1.8
Wait% 0.0 0.0 300.0
Idle% 8.4 12.1 1.4
CPU% 18.2 26.6 1.5
PhysCPU 7.3 10.6 1.5
CPU101
CPU089
User% 7.6 12.5 1.6
CPU077
CPU065
VP_CPU: Avg Max Max:Avg
© Copyright IBM Corporation 2015
143
Updated
Example using vpm_througput_mode on POWER7 (3/4) § vpm_throughput_mode=2 (fill two LP on VP before unfolding additional VP) REF1 SRAD MEM CPU" 0 0 36829.00 0-3 8-11 16-19 24-27 32-35 40-43 48-51 56-59 68-71 80-83 92-95 104-107 116-119 128-131 140-143 152-155 1 36817.94 4-7 12-15 20-23 28-31 36-39 44-47 52-55 60-63 72-75 84-87 96-99 108-111 120-123 132-135 144-147 156-159 2 19402.19 64-67 76-79 88-91 100-103 112-115 124-127 136-139 148-151"
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Sys% 2.6 3.4 1.3
Wait% 0.0 0.0 150.0
Idle% 7.6 10.1 1.3
CPU% 19.4 24.0 1.2
PhysCPU 7.8 9.6 1.2
CPU089
User% 9.2 12.7 1.4
CPU077
CPU065
VP_CPU: Avg Max Max:Avg
© Copyright IBM Corporation 2015
144
Updated
Example using vpm_througput_mode on POWER7 (4/4) § vpm_throughput_mode=4 (fill four LP on VP before unfolding additional VP) REF1 SRAD MEM CPU" 0 0 36829.00 0-3 8-11 16-19 24-27 32-35 40-43 48-51 56-59 68-71 80-83 92-95 104-107 116-119 128-131 140-143 152-155 1 36817.94 4-7 12-15 20-23 28-31 36-39 44-47 52-55 60-63 72-75 84-87 96-99 108-111 120-123 132-135 144-147 156-159 2 19402.19 64-67 76-79 88-91 100-103 112-115 124-127 136-139 148-151"
User% 6.4 9.0 1.4
Sys% 1.7 2.6 1.5
Wait% 0.0 0.0 60.0
Idle% 5.9 9.9 1.7
CPU% 14.0 20.2 1.4
PhysCPU 5.6 8.1 1.4
CPU065
VP_CPU: Avg Max Max:Avg
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
145
Updated
Adapter Placement
© Copyright IBM Corporation 2015
Updated
5803 I/O drawer Adapter Placement Guide § Actual adapter placement information can be collected from: • sysplan • mksysplan -f FILENAME.sysplan -m MANAGEDSYSTEM • NOTE: lssyscfg -r sys -F name
• lshwres • lshwres –m MANAGEDSYSTEM -r io --rsubtype slot -F drc_name drc_index description lpar_name
• pcat • hmc/…/lshwres-r_io-rsubtype_slot.txt
• snap • general.snap • NOTE: lsslot -c pci
PLAN adapter placement, MONITOR utilization and VERIFY layout For more information on IBM Power 795 (9119-FHA) Adapter Placement: http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/topic/areab/areabfc5803.htm http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=/areab/areabkickoff.htm http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/topic/areab/areab.pdf Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
147
Updated
5803 I/O drawer
For optimum performance, consider limiting the total number high bandwidth and extra-high bandwidth adapters (EHB), using the following guidelines: • • •
• • •
Location Code P1-C1 P1-C2 P1-C3 P1-C4 P1-C5 P1-C6 P1-C7 P1-C8 P1-C9 P1-C10 P2-C1 P2-C2 P2-C3 P2-C4 P2-C5 P2-C6 P2-C7 P2-C8 P2-C9 P2-C10
I/O chip I/O chip 1 I/O chip 2 I/O chip 3
I/O chip 4 I/O chip 5
PCI Host Bridge PHB1 PHB2 PHB3 PHB4 PHB5 PHB6 PHB7 PHB8 PHB9 PHB10 PHB11 PHB12 PHB13 PHB14 PHB15 PHB16 PHB17 PHB18 PHB19 PHB20
Slot Slot priority priority (one (separate loop) loops) 1 1 5 3 9 5 3 2 7 4 11 6 13 7 15 8 17 9 19 10 2 1 6 3 10 of 5 4 2 8 4 12 6 14 7 16 8 18 9 20 10
No more than three Gb Ethernet ports per I/O chip. I/O chip 6 No more than three high bandwidth adapters per I/O chip. No more than one Extra-high bandwidth adapter per I/O chip. if both ports are concurrently used then each port count as one adapter: 5708 10 Gb FCoE PCIe Dual Port Adapter 5735 8 Gigabit PCI Express Dual Port Fibre Channel Adapter No more than one 10 Gb Ethernet port per two processors in a system. If one 10 Gb Ethernet port is present per two processors, no other 10 Gb or 1 Gb ports allowed for optimum performance. No more than two 1 Gb Ethernet ports per one processor in a system. More Ethernet adapters can be added for connectivity. Place the highest performance adapters in slots P1-C1 through P1-C6 and P2-C1 through P2-C6. •
The 5803 I/O drawer has two I/O planar boards, and each planar has three I/O chips. Each I/O chip controls 3 or 4 PCI host bridges (PHBs) and each PCIe slot connects directly to a PHB.
148 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
148
Updated
5803 I/O expansion unit internal diagram
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
149
Updated
5803 brief considerations for adapter placement 1/2
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
150
Updated
5803 brief considerations for adapter placement 2/2 § For availability, use redundant adapters from different I/O drawers. – FC adapters with Multi Pathing software, usually 4-8 ports for business critical high workload partitions – Ethernet adapters, group in either Etherchannel configuration with either load-balancing or Network Interface Backup (NIB) mode
§ Avoid using adapter slots P1-C7—C10 for Extra High Bandwidth adapters. § Spread the use of Extra High Bandwidth adapters over as many I/O chip as possible. § Intersperse Extra High Bandwidth adapters on the same I/O chip with lower bandwidth adapters. – If the Extra High Bandwidth adapter is for business critical partition and the other adapters on a I/O chip are for noncritical partitions, consider placing the Extra High Bandwidth adapter in the highest priority slot – see “Adapter Placement Guide”. – If saturating the I/O chip with one Extra High Bandwidth adapter, and need to use other adapters on the same I/O chip, consider placing the Extra High Bandwidth adapter in a lower priority slot than a lower utilized adapter. – If highly utilized Extra High Bandwidth adapters, spread and if not sufficient to maintain peak load add additional I/O drawers and relocate adapters.
§ If using 8Gbit FC adapters (5735) consider enabling both ports, and replace one port with another adapter if the adapter I/O is saturating the I/O chip and preventing other adapters from performing properly – see also recommendation regarding the fcstat command statistics for “No Adapter Element Count”. § If using 10 Gigabit Ethernet adapter, use default setting and enable large send/receive functionality, use switch and adapter flow control, consider enabling Jumbo Frames for higher network frame payload, monitor buffer pool utilization with the netstat and entstat commands.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
151
Updated
IBM Power 780 (9179-MH[BC]) Adapter Placement Guide
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
152
Updated
Brief considerations for adapter enclosure placement
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
153
Updated
Power 9179-MHD Quad Socket Planar
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
154
Updated
Network Tracing introduction
© Copyright IBM Corporation 2015
Updated
Network IP trace § Ensure a file system with at least 1GB free space is available (for 10-20min traffic) § Start IP trace, and preferably limit what is traced if possible, such as interface and protocol (and possibly source/destination hosts): – startsrc -s iptrace -a "-i enX -P tcp nettrcf_raw“
§ Stop IP trace: – stopsrc -s iptrace
Example throughput graph illustrating a problem using GUI
§ Can create a text report: – ipreport -v nettrcf_raw > nettrcf_report
§ Can use the open source Wireshark GUI tool from – http://www.wireshark.org/download.html
§ Can use the open source Wireshark command line tool tshark, such as: – tshark.exe -R "tcp.len>1448“ -r nettrcf_raw
Example illustrating a problem using tshark
… 1005 122.895749299 10.1.1.13 -> 10.1.1.17 TCP 18890 50770 > 5001 [ACK] Seq=35742433 Ack=1 Win=32761 Len=18824 TSval=1335798940 TSecr=1334065961 1009 122.896252205 10.1.1.13 -> 10.1.1.17 TCP 23234 [TCP Previous segment lost] 50770 > 5001 [ACK] Seq=35956737 Ack=1 Win=32761 Len=23168 TSval=1335798940 TSecr=1334065961 … 156 © Copyright IBM Corporation 2015 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Updated
Wireshark Graphical user interface Command menus
Filter Specification
List of captured packets
Details of selected packet header
Packet content in hexadecimal and ASCII Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
157
Updated
Network IP trace analysis using Wireshark
Note protocol issues during workload tests – Sampled using perfPMR iptrace – TCP segments lost – UCP checksum issues – IP checksum issues 1.
Wireshark > Open iptrace
2.
Analysis > Expert Info Composite
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
158
Updated
IP Trace Findings § Only enable tcp_nodelayack if the actual network traffic workload require § This sampled iptrace data indicate that tcp_nodelayack is not required for this case – Response from DB (port 1531 - mapped as rap-listen), is almost immediate (in <= 20 milliseconds) – By enabling tcp_nodelayack (tcp_nodelayack=1), the receiver is sending an ACK to sender after every received packet, which result in additional network packets to be sent for each packet with data transferred and significantly increase the network load. – If disabled (default) the ACK will be sent with the response to the sender or at the latest with a delay up to 200ms (default). – If multiple partitions have packet rates which put the load on VIOS SEA above 100K packets/s and enabling tcp_nodelayack for all partitions will basically double the packet rate to ~200K packets/s. • If the traffic bridge over SEAs between hosts, the network stack will have unnecessary high traffic volume
Filter used - (ip.addr eq 172.30.1.60 and ip.addr eq 172.30.1.80) and (tcp.port eq 50838 and tcp.port eq 1531) Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
159
Updated
IP Trace Findings § Only enable tcp_nodelayack if the actual network traffic workload require § This sampled iptrace data indicate that tcp_nodelayack is not required for this case – We notice the "request" packet comes to 10.2.50.1 and it responds immediately for most of them (in less than 1 millisecond most of the times). – But every now and then, 10.2.50.1 takes few seconds to respond for few requests. – Such as request (packet # 14618) comes, but the response is going only after ~3.7s and that is the reason a delayed ACK to the request is sent after 150 milliseconds. – So even if we immediately send an ACK (packet # 15646) without waiting for 148 milliseconds, it is not going to help the performance here, as in any case the response is going out only after ~3.7s. – Once the connection is established most of the times the request/response happens fast (though we notice cases where the time delta between requests will be around a second and at times the response takes 200 ms etc)
Filter used: (ip.addr eq 10.2.50.132 and ip.addr eq 10.2.50.1) and (tcp.port eq 4295 and tcp.port eq 1521) Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
160
Updated
Network throughput simple tests (1/3) § FTP – Simple single stream throughput using actual socket settings. – This test read from the special file /dev/zero that is unlimited in size and provides all zero data, and on the destination server its written to the special data sink file /dev/null which just discard the data and do not store it (null operation). – Assuming ftpd is started on receiver side (default port 20 & 21), start the sender with the ftp command, such as below command for 1.0 GB data transferred by FTP (TCP protocol), 1MB blocks 1000 times, can also vary the block size and count and use .netrc with macdef init: ftp <FQDN/IP address of receiver> bin put "|dd if=/dev/zero bs=1000k count=1000" /dev/null bye
§ IPERF – IPERF is a Open Source tool for measuring maximum TCP and UDP bandwidth performance and allows the tuning of various parameters, characteristics and reports bandwidth, delay jitter, and datagram loss. – Can specify socket specific TCP send and receive buffers (-u option specifies UDP protocol) • https://code.google.com/p/iperf/ - IPERF3 (BSD license) • http://sourceforge.net/projects/iperf/ - IPERF2 (BSD like open source licence) • http://www.perzl.org/aix/index.php?n=Main.iperf - RPM for IPERF iperf-2.0.5-1.aix5.1.ppc.rpm • http://www.oss4aix.org/download/RPMS/iperf/ - RPM for IPERF 2.0.2 • https://code.google.com/p/iperf/wiki/ManPage & http://openmaniak.com/iperf.php – Start the receiver (default port 5001), and then the sender: 1.iperf –s –w 786432 2.iperf –w 786432 –P <#VPs> -c <FQDN/IP address of receiver>
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
161
Updated
Network throughput simple tests (2/3) § FTP ~/.netrc file example – adjust count and block size to run 1-3 machine FQDN-IPADDRESS root password PASSWORD macdef init bin put "|dd if=/dev/zero bs=1400 count=500000" /dev/null put "|dd if=/dev/zero bs=1400 count=500000" /dev/null put "|dd if=/dev/zero bs=1400 count=500000" /dev/null put "|dd if=/dev/zero bs=4096 count=170898" /dev/null put "|dd if=/dev/zero bs=4096 count=170898" /dev/null put "|dd if=/dev/zero bs=4096 count=170898" /dev/null put "|dd if=/dev/zero bs=8192 count=85449" /dev/null put "|dd if=/dev/zero bs=8192 count=85449" /dev/null put "|dd if=/dev/zero bs=8192 count=85449" /dev/null put "|dd if=/dev/zero bs=16384 count=42725" /dev/null put "|dd if=/dev/zero bs=16384 count=42725" /dev/null put "|dd if=/dev/zero bs=16384 count=42725" /dev/null put "|dd if=/dev/zero bs=30720 count=22786" /dev/null put "|dd if=/dev/zero bs=30720 count=22786" /dev/null put "|dd if=/dev/zero bs=30720 count=22786" /dev/null put "|dd if=/dev/zero bs=61440 count=11393" /dev/null put "|dd if=/dev/zero bs=61440 count=11393" /dev/null put "|dd if=/dev/zero bs=61440 count=11393" /dev/null put "|dd if=/dev/zero bs=122880 count=5697" /dev/null put "|dd if=/dev/zero bs=122880 count=5697" /dev/null put "|dd if=/dev/zero bs=122880 count=5697" /dev/null Bye Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
162
Updated
Network throughput simple tests (3/3) To lpar3 from lpar2 seq Duration(s) Kbytes/s size 1 2.17 2535 1024 2 2.708 2031 1024 3 2.436 2258 1024 4 4.932 2027 4096 5 5.511 1814 4096 6 6.003 1666 4096 7 3.395 2828 8192 8 3.215 2986 8192 9 6.235 1540 8192 10 5.53 2893 16384 11 9.653 1658 16384 12 5.456 2932 16384 13 9.178 2896 30720 14 11.23 2366 30720 15 10.45 2543 30720 16 12.79 2830 61440 17 13.11 2759 61440 18 21.7 1668 61440 19 29.8 1840 122880 20 20.27 2705 122880 21 17.72 3094 122880
To lpar2 from lpar3 count 5500 5500 5500 2500 2500 2500 1200 1200 1200 1000 1000 1000 886 886 886 603 603 603 457 457 457
Kbytes/s size 7094 1024 6465 1024 5270 1024 7452 4096 2414 4096 6330 4096 7676 8192 7898 8192 7500 8192 6073 16384 2761 16384 4362 16384 2801 30720 7310 30720 3945 30720 7510 61440 3254 61440 6811 61440 3832 122880 3650 122880 7181 122880
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
count 5500 5500 5500 2500 2500 2500 1200 1200 1200 1000 1000 1000 886 886 886 603 603 603 457 457 457
SPOT CHECK diff/s duration diff/s kbps 1.3947 -4559 1.8573 -4434 1.392 -3012 3.59 -5425 1.369 -600 4.423 -4664 2.144 -4848 2 -4912 4.955 -5960 2.895 -3180 3.859 Run FTP -1103 in both directions 1.788 -1430 If significant difference, check also with -0.312 95 7.594 traceroute -4944 and ping –R in both 3.713 directions. -1402 7.973 -4680 Check tcp buffers and congestion 1.99 -495 window (use IPERF or equivalent to 16.388 alter -5143 buffer sizes during testing) 15.49 -1992 iptrace 5.25 Check-945 10.083 -4087
Check network cables, switches, routers, firewalls and interlinks
© Copyright IBM Corporation 2015
163
Updated
DPO Simulation
© Copyright IBM Corporation 2015
Updated
Using DPO
Reference Anandakumar Mohan
§ Use case with Brokerage Simulation application and database – Simulation by IBM Lab Services ISA team
§ Workload with 70 Job-sets – – – –
Starting with I/O intensive database restore Thereafter transactions for Customer Exchange (CE) and Market Exchange (ME) Partition configured with 384GB memory Two runs with same data and settings before and after DPO defragmentation of the partition
Before DPO • Lower throughput, higher response times
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
After DPO • Throughput increased by ~20% • Response time reduced for all transaction types
© Copyright IBM Corporation 2015
165
Updated
Before optimization – lssrad and lsmemopt
Partition is placed across 3 nodes
Most of the CPU and memory are from 2 nodes DPO Current score is 58 DPO Predicted score is 99
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
166
Updated
After optimization – lssrad and lsmemopt
The DPO duration in this case took ~1 ½ minute to complete, with the partition active but idle.
APAR IV42662: LSSRAD SHOWS SMALL INCONSISTENCIES IN MEMORY REPORTING AFTER DPO
If a partition is shutdown DPO completes the fastest. If the Server have free memory, such as more installed than active, the hypervisor can use this memory for rearrangement.
Most of the CPU and memory are now contained one node after DPO
DPO Current score is now 95 DPO Predicted score is now 98
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
167
Updated
System sw & Firmware
© Copyright IBM Corporation 2015
Updated
AIX partition software level consistency check using niminv /usr/sbin/niminv -o invcmp -a targets=‘aix13,aix19' -a base=‘aix13' -a location='/tmp/123‘ Comparison of aix13 to aix13:aix19 saved to /tmp/123/comparison.aix13.aix13:aix19.120426230401. Return Status = SUCCESS
cat /tmp/123/comparison.aix13.aix13:aix19.120426230401 name ----------------------------------------AIX-rpm-7.1.0.1-1 ... bos.64bit bos.acct bos.adt.base bos.adt.include bos.adt.lib ... bos.rte ... base 1 2 '-' same
= = = = =
base ---------7.1.0.1-1
1 ---------same
7.1.0.1 7.1.0.0 7.1.0.0 7.1.0.1 7.1.0.0
same same same same same
same same same 7.1.0.0 same
7.1.0.1
same
same
comparison base = aix13 aix13 aix19 name not in system or resource name at same level in system or resource
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
2 ---------same
Differences are shown for all non matching VPD components © Copyright IBM Corporation 2015
169
Updated
IBM Power Systems System Firmware (Microcode) § Keeping firmware current will help in attaining the maximum reliability and functionality from your systems. §
Release Levels – Stay on Release Levels that are supported via Service Packs to continue to receive firmware fixes. – New Release Levels are targeted to be released twice a year. – Upgrading from one Release Level to another will always be disruptive.
§
Service Packs – The first Service Pack will generally be released approximately six to eight weeks following a Release Level and then subsequently at 3 to 4-month intervals. – Updates to Service Packs within the same Release Level can be performed concurrently.
For more information on Firmware Service Strategies and Best Practices: http://www14.software.ibm.com/webapp/set2/sas/f/best/IBMPowerSystemsFirmware_Best_Practices_v6.pdf Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
170
Updated
Microcode Discovery Service (MDS) How to use MDS: 1. Download the most current microcode catalog.mic text file 2. Replace the catalog.mic text file on each AIX/VIOS partition 3. Run the invscout command on each AIX/VIOS partition 4. Collect the <PartitionName>.mup file from each AIX/VIOS partition 5. Concatenate all generated .mup files into one file (such as all.mup) 6. Upload the concatenated all.mup file to the MDS website
For more information on Microcode Discovery Service (MDS): http://www14.software.ibm.com/webapp/set2/mds/ Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
171
Updated
Microcode Discovery Service (MDS) § Microcode Discovery Service and the invscout command –
The MDS website: •
–
http://www14.software.ibm.com/webapp/set2/mds/
Use the AIX invscout command to check the currently installed microcode levels on physical hardware •
http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.cmds/doc/aixcmds3/invscout.htm
§ How to run MDS 1. Select partitions that have physical adapters 2. Download the most current microcode catalog text file: • http://techsupport.services.ibm.com/server/mdownload/catalog.mic 3. As root, replace the /var/adm/invscout/microcode/catalog.mic with the downloaded 4. As root, run the command invscout • The invscout command will create the /var/adm/invscout/HOSTNAME.mup file 5. To generate a report with recommendations for the currently installed microcode levels, upload the /var/adm/invscout/HOSTNAME.mup to the IBM MDS website: • http://www14.software.ibm.com/webapp/set2/mds/fetch?page=mdsUpload.html – Note: Concatenate several .mup files into one file, and upload once to create a consolidated report. – Also: 1. 2.
Before installing any microcode, be sure to review each README file AND follow the instructions, and if necessary schedule a service window. Always use the latest microcode catalog file, if not the MDS report will contain a warning.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
172
Updated
MDS report (one server sample) Severity definitions HIPER
SPE
ATT
PE New
High Impact/PERvasive Should be installed as soon as possible. SPEcial Attention Should be installed at earliest convenience. Fixes for low potential high impact problems. ATTention Should be installed at earliest convenience. Fixes for low potential low to medium impact problems.
Server name, Serial Number, IP address, etc
Programming Error Can install when convenient. Fixes minor problems. New Firmware Release level for a product.
Impact statement Availability – Fixes that improve the availability of resources. Data – Fixes that resolve customer data errors. Function – Fixes that add or affect system or machine operation related to features, connectivity or resource. Security – Fixes that improve or resolve security issues. Serviceability – Fixes that influence problem determination or fault isolation and maintenance related to diagnostic errors etc. Performance – Fixes that improve or resolve throughput or response times. …lines omitted…
For Impact, Severity and other Firmware definitions, Please refer to the 'Glossary of firmware terms' url: http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
173
Updated
FLRT – FLRT Lite
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
174
Updated
POWER7 High-End System Firmware History Power7 High-End System Firmware Fix History - Release levels AH760 http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/AH-Firmware-Hist.html AH760_068_043 / FW760.30 06/24/13
•
…
AH760_062_043 / FW760.20 02/27/13
•
…
AH760_043_043 11/21/12
Impact: Availability
Impact: Availability
Severity: SPE
Severity: SPE
Impact: New Severity: New New Features and Functions • … • Support for 0.05 processor granularity. • Support for 64GB DIMMs. • Support for Dynamic Platform Optimizer (DPO). • The Hypervisor was enhanced to enforce broadcast storm prevention between the primary and backup SEAs (Shared Ethernet Adapters). This fix requires VIOS 2.2.2.0 or later on all VIOS partitions with SEA devices. Additional Requirements: • FC EB33, available at no charge, needs to be ordered for DPO • Partitions included in DPO optimization need to running an affinity aware version of the operating system OR need to be restarted after DPO completes. If not, partitions can be excluded from participation in optimization through a command line option on the optmem command. Notes: – Affinity aware operating system (OS) levels that support DPO: ◦ AIX 6.1 TL8 or later ◦ AIX 7.1 TL2 or later ◦ VIOS 2.2.2.0 ◦ IBM i 7.1 PTF MF56058 - No integrated support for DPO in current RHEL or SUSE Enterprise versions. Linux partitions can either be excluded from participation in optimization or restarted after DPO operation completes.
For more information on Firmware Description and History: Low-End Mid-range High-End
– ftp://ftp.boulder.ibm.com/software/server/firmware/AL-Firmware-Hist.html – ftp://ftp.boulder.ibm.com/software/server/firmware/AM-Firmware-Hist.html – ftp://ftp.boulder.ibm.com/software/server/firmware/AH-Firmware-Hist.html
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
175
Updated
Adapter Firmware Severity definitions HIPER
SPE
ATT
PE New
High Impact/PERvasive Should be installed as soon as possible. SPEcial Attention Should be installed at earliest convenience. Fixes for low potential high impact problems. ATTention Should be installed at earliest convenience. Fixes for low potential low to medium impact problems. Programming Error Can install when convenient. Fixes minor problems. New Firmware Release level for a product.
For more information on Adapter Firmware: http://www14.software.ibm.com/webapp/set2/firmware/lgjsn?mode=44&mtm=9119-FHB Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
176
Updated
Subscribe to IBM bulletins and notifications §
Regularly review IBM bulletins for software update advisories, High Impact and security issues – My notifications > System p bulletins – FLASHES – TECHNOTES – APAR update subscription
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
177
Updated
My notifications > System p bulletins
http://www14.software.ibm.com/webapp/set2/subscriptions/ijhifoe?mode=1&prefsOnOff=null &heading=AIX71&topic=TL00&month=ALL
See My notifications web site for information bulletins: http://www14.software.ibm.com/webapp/set2/subscriptions/pqvcmjd Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
178
Updated
My notifications > Subscribe
Select Power and System p/i but also separately Select Other software
See My notifications web site for information bulletins: https://www-947.ibm.com/systems/support/myview/subscription/css.wss/folders?methodName=listMyFolders Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
179
Updated
My notifications > Subscribe > Other Software
Other Software > AIX > Continue Select Document types to subscribe to Note: all document types listed may not be available for all products. https://www-947.ibm.com/systems/support/myview/subscription/css.wss/subscriptions#help-2 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
180
Updated
APAR update subscription How to subscribe to APAR updates The following options can be used to subscribe for APAR update notification through My Support: 1.
Troubleshooting or APAR category option • •
2.
This option provides notification of new APARs only. When you create a subscription for a product, select Troubleshooting or APAR (Authorized Program Analysis Reports) on the Subscribe tab "Document types". If the APAR option is not available for your product, you can subscribe by using the APAR or Component ID option.
APAR or Component ID option • •
After you find a specific APAR document, you can select a subscription option from the Subscribe to this APAR section at the top of the APAR page. Subscription options can include one or both of these selections: •
•
Notify me when this APAR changes Notifications are based on a specific APAR. This option is only available if the APAR is "open" or "closed" with a fix pending. You are notified as the APAR progresses through its lifecycle. Notify me when an APAR for this component changes Notification is based on a component ID and includes all APARs associated with the selected component. If customer need to track changes to every APAR for their product, use this option. You are notified as each APAR progresses through its lifecycle including when a PTF becomes available.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
181
Updated
Some documentation
© Copyright IBM Corporation 2015
Updated
POWER7 Virtualization Best Practice guide
§ Table of Content –
– –
Virtual Processors – Sizing/configuring virtual processors – Entitlement vs. Virtual processors – Matching entitlement of a LPAR close to its average utilization for better performance – When to add additional virtual processors – How to estimate the number of virtual processors per uncapped shared LPAR – Virtual Processor Management - Processor Folding – Processor Bindings in Shared LPAR Recommendation on page table size for LPAR Recommendation for placing LPAR resources to attain higher memory affinity – What is SPPL option do on Power 795 system – How to determine if a LPAR is contained within a book – Forcing critical LPAR to get the best resource assignment – Affinity Groups – PowerVM resource consumption for capacity planning considerations – Licensing resources (CuOD)
https://www.ibm.com/developerworks/wikis/download/attachments/53871915/P7_virtualization_bestpractice.doc Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
183
Updated
AIX on Power – Performance FAQ
§ Table of Content – – – – – – – – – –
What Is Performance Performance Benchmarks Workload Estimation and Sizing Performance Concepts Performance Analysis and Tuning Process Performance Analysis How-To Frequently Asked Questions POWER7 Java Reporting a Performance Problem
http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=POW03049USEN Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
184
Updated
POWER7 Optimization and Tuning Guide
§ Table of Content – Chapter 1. Optimization and tuning on the IBM POWER7 and IBM POWER7+ – Chapter 2. The POWER7 processor – Chapter 3. The POWER Hypervisor – Chapter 4. AIX – Chapter 5. Linux – Chapter 6. Compilers and optimization tools for C/C++/FORTRAN – Chapter 7. JAVA – Chapter 8. DB2 – Chapter 9. WebSphere Application Server – Appendix A. The AIX malloc cookbook – Appendix B. Performance tooling and empirical performance analysis – Appendix C. POWER7 optimization and tuning with third-party applications
http://www.redbooks.ibm.com/abstracts/sg248079.html Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
185
Updated
IBM Power Systems Performance Guide: Implementing and Optimizing § Table of Content – Chapter 1. IBM Power Systems and performance tuning – Chapter 2. Hardware implementation and LPAR planning – Chapter 3. IBM Power Systems virtualization – Chapter 4. Optimization of an IBM AIX operating system – Chapter 5. Testing the environment – Chapter 6. Application optimization – Appendix A. Performance monitoring tools – Appendix B. New commands and new commands flags – Appendix C. Workloads
http://www.redbooks.ibm.com/abstracts/sg248080.html Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
186
Updated
AIX Performance management and tuning
IBM AIX Dynamic System Optimizer The IBM® AIX Dynamic System Optimizer is a stand-alone feature for the AIX operating system that automatically adjusts some settings to maximize the efficiency of your system. Performance management You can use this information to complete tasks such as assessing and tuning the performance of processors, file systems, memory, disk I/O, NFS, JAVA, and communications I/O. The topics also address efficient system and application design, including their implementation. This topic is also available on the documentation CD that is shipped with the operating system. Performance Tools Guide and Reference The performance of a computer system is based on human expectations and the ability of the computer system to fulfill these expectations. The objective for performance tuning is to make those expectations and their fulfillment match. The path to achieving this objective is a balance between appropriate expectations and optimizing the available system resources. The performance-tuning process demands great skill, knowledge, and experience, and cannot be performed by only analyzing statistics, graphs, and figures.
http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.doc/doc/base/performance.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
187
Updated
Technical Note from IBM Oracle International Competency Center (ICC) IBM POWER7 AIX and Oracle Database performance considerations March 21, 2014 § Table of Content –
–
– – –
Introduction – Table 1: Suggested Power and Oracle considerations with URL links to important documents Oracle DB 11gR2 standard practices for IBM AIX – Memory – CPU – I/O – Network – Miscellaneous AIX fixes for Oracle 10.2.0.4 and 11gR2 Oracle patches to check in the context of AIX Recent suggestions and open issues
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP102171 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
188
Updated
Oracle Architecture and Tuning on AIX v2.30
§ Table of Content –
–
–
Oracle Database Architecture – Database structure – Instance and Application Processes – Oracle Memory Structures AIX Configuration & Tuning for Oracle – Overview of AIX VMM – Memory and Paging – I/O Configuration – CPU Tuning – Network Tuning Oracle Tuning
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100883 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
189
Updated
IBM Power 795 (9119-FHB) § The IBM Power 795 enterprise Power server offers leadership performance and massive scalability, along with the reliability, manageability and security features needed to consolidate AIX, IBM i, and Linux applications in the largest and most demanding data centre environments. § The IBM Power 795 server uses 64-bit POWER7 eight-core processor technology in up to 256-core configurations with processor-based symmetric multiprocessing (SMP) and PowerVM virtualization, with up to 20 micro-partitions per processor (1,000 maximum). – POWER® processor technology is an instruction-set architecture that spans applications from consumer electronics to supercomputers, and is built on an open architecture (http://www.power.org). For more information on IBM Power 795 and PowerVM: http://www-03.ibm.com/systems/power/hardware/795/index.html http://www.redbooks.ibm.com/abstracts/redp4640.html http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=AN&subtype=CA&htmlfid=897/ENUS212-344 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
190
Updated
IBM Power 780 (9179-MHB/9179-MHD) § The IBM Power 780 provides a unique combination of performance across multiple workloads and mainframe inspired availability features to keep a business running, with the utmost in infrastructure efficiency at enterprise scale, such as for largescale server consolidation and as a complete business system combining all aspects of a company’s IT infrastructure. – The IBM Power 780 is a high-end POWER7 processorbased symmetric multiprocessing (SMP) system.
For more information on IBM Power 780/770: http://www-03.ibm.com/systems/power/hardware/780/ http://www.redbooks.ibm.com/abstracts/redp4639.html http://www.redbooks.ibm.com/redpieces/abstracts/redp4924.html Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
191
Updated
IBM Systems Lab Services and Training
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
192