Björn Rodén (roden@ae.ibm.com) Prerna Awasthi (prerawas@in.ibm.com) http://www.ibm.com/systems/services/labservices/ http://www.ibm.com/systems/power/support/powercare/
Thanks to:
Sivakumar K, Niranjan S, Herman D, Kiet L, Gupa V, Zhi-wei D, et al
Introduction to network analysis with Wireshark open source packet analyzer
© Copyright IBM Corporation 2015
Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM. Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM.
9.0
Session Objectives •
This session introduce how to analyze network issues using iptrace and Wireshark opensource packet analyzer – Overview – Some customer examples – Additional examples identifying common network issues
objective
© Copyright IBM Corporation 2015
You will learn how to approach network analysis with iptrace and wireshark
2
2
System Levels / Perspectives
© Copyright IBM Corporation 2015
4
Approach to analyze network issues using iptrace •
Determine the network issue at a very high level. Examples include: – – –
•
Since iptrace is resource intensive, narrow the scope of the issue by: – – – –
• •
response is slow intermittent delays application is fine but excessive network load reported by network team Indentifying the time (in terms of minutes) when the issue is occurring Indentifying the interface that is being used by the application that is facing the issue Indentifying the destination server with which connection is established (in scenarios such as database connectivity issues) Indentifying the protocols being used
Run the iptrace command with appropriate options Feed the binary trace file to ipreport or wireshark (tshark) for analysis
TCP/IP Basics •
Internet Protocol (IP) is the principal communications protocol used for host-to-host datagram service across an internetwork using the TCP/IP protocols. – –
•
Internet Control Message Protocol (ICMP) is a protocol used for messages with diagnostic or routing purposes. Address Resolution Protocol (ARP) is a protocol used for resolution of network layer addresses into link layer addresses.
TCP and UDP are Transport Layer protocols –
–
Transmission Control Protocol (TCP) is a connection-oriented protocol for data transmission • Attributes: Reliable, Ordered, Heavyweight • Protocol are responsible for proper data delivery • Connections are established using three-way handshake User Datagram Protocol (UDP) is a connectionless protocol for data transmission • Attributes: Unreliable, Unordered, Lightweight • Applications are responsible for proper data delivery • Connections are established by transmitting data
iptrace • • • • •
iptrace is a network packet tracing and monitoring command on AIX iptrace records trace data into a binary file that can be formatted by tools such as ipreport (on AIX) and wireshark (on Windows) iptrace provides interface-level packet tracing for Internet protocols iptrace records Internet packets received from configured interfaces iptrace flags provide a filter so that the daemon traces only packets meeting specific criteria such as – – – –
• •
Port Protocol Host Interface
iptrace uses net_xmit_trace kernel service for tracing iptrace command can either run as a daemon or under the System Resource Controller (SRC)
How to run iptrace There are two ways to invoke iptrace: 1. Using System Resource Controller: To start iptrace: startsrc -s iptrace -a “<iptrace_options>”
Example: startsrc -s iptrace –a “-i en0 -p telnet -s testhost /tmp/iptrace”
Record packets on interface “en0” using “telnet” protocol from host “testhost” into file /tmp/iptrace To stop iptrace: stopsrc –s iptrace
2. Direct command line: To start iptrace: iptrace <options>
Example: iptrace -i en0 -p telnet -s testhost /tmp/iptrace
To stop iptrace: kill -15 <iptrace_PID>
3. NOTE: If iptrace is stopped using kill -9, then iptrace –u has to be used to unload the kernel extensions loaded by iptrace
iptrace command options Because network tracing can produce large amounts of data, it is important to limit the network trace either by scope (what to trace) or amount (how much to trace) iptrace command options can help in limiting the scope and hence the system resources consumed by iptrace Syntax /usr/sbin/iptrace [ -a ] [ -b ][ -e ] [ -u ] [ -PProtocol_list ] [ -iInterface ] [ -pPort_list ] [ -sHost [ -b ] ] [ -dHost ] [ -L Log_size ] [ -B ] [ -T ] [ -S snap_length] LogFile
Some flags for the iptrace command: -a: Suppresses ARP packets -i:
Records packets received on the interface specified
-P: Records packets that use the protocol specified -d: Records packets headed for the destination host -s: Records packets coming from the source host -b: Changes the -d or -s flags to bidirectional mode
Protocols supported by iptrace: ip, icmp, ggp, tcp, udp, pup
For more information on iptrace: http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.cmds3/iptrace.htm
tcpdump The tcpdump command prints the headers of packets on a network interface that match the boolean expression, and can read input from a previous binary capture file or one network interface in promiscuous mode from a /dev/bpf* device. The matching expression can consist of single or combined primitives, and operators, which can be passed to tcpdump as either a single argument or as multiple arguments. Only packets that match expressions are processed by the tcpdump command. â&#x20AC;&#x201C; If it is not run with the -c flag, tcpdump continues capturing packets until it is interrupted by a SIGINT signal (typically control-C) or a SIGTERM signal (typically the kill(1) command). If tcpdump is run with the -c flag, it captures the packets until it is interrupted by a SIGINT or SIGTERM signal or the specified number of packets have been processed.
Syntax tcpdump [ -a ] [ -A ] [ -d ] [ -D ] [ -e ] [ -f ] [ -l ] [ Start of change-KEnd of change ] [ -L ] [ -n ] [ -N ] [ -O ] [ -p ] [ -q ] [ -R ] [ -S ] [ -t ] [ -u ] [ -U ] [ -v ] [ -x ] [ -X ] [ -c count ] [ -C file_size ] [ -F file ] [ Start of change-GEnd of change rotate_seconds ] [ -i
interface ] [ -m module ] [ Start of change-MEnd of change secret ] [ -r file ]
[ -s snaplen ] [ -w file ] [ -E addr ] [ -y datalinktype ] [-z command ] [-Z user ] [ expression ] Note: tcpdump would consume the system resources once the dump is initiated , which may impact the system performance.
For more information on tcpdump: http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.cmds5/tcpdump.htm
ipreport ipreport command is a formatter for the trace log file generated by iptrace Syntax /usr/sbin/ipreport [ -e ] [ -r ] [ -n ] [ -s ] LogFile /usr/sbin/ipreport [ -C ] [ -e ] [ -n ] [ -r ] [ -s ] [ -S ] [ -v ] [-x ] [ -1 ] [ -N ] [ -T ] [ -c count ] [ -j pktnum ] [ -X bytes ] tracefile
Flags -c count Displays the number of packets. -C
Validates checksum.
-e
Generates the trace report in EBCDIC format. The default format is ASCII.
-j pktnum
Jumps to the packet number specified by the pktnum variable.
-n
Includes a packet number to facilitate easy comparison of different output formats.
-N
Does not resolve the names.
-r Decodes remote procedure call (RPC) packets. -s
Prepends the protocol specification to every line in a packet.
-S
Generates the input file on a sniffer.
-T
Represents the input file in the tcpdump format.
-v
Verbose.
-x
Prints the packets in the hexadecimal format.
-X bytes Limits the hexadecimal dumps to the value determined by the bytes variable.
For more information on ipreport: http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.cmds3/ipreport.htm
Using Wireshark – all you need to know in one slide :) Ensure a file system with at least 1GB free space is available (for 10-20min intensive traffic) Start IP trace, and preferably limit what is traced if possible, such as interface and protocol (and possibly source/destination hosts): – startsrc -s iptrace -a "-i enX -P tcp nettrcf_raw“
Stop IP trace: – stopsrc -s iptrace
Example throughput graph illustrating a problem using GUI
Can create a text report: – ipreport -v nettrcf_raw > nettrcf_report
Can use the open source Wireshark GUI tool from – http://www.wireshark.org/download.html
Can use the open source Wireshark command line tool tshark, such as: – tshark.exe -R "tcp.len>1448“ -r nettrcf_raw
Example illustrating a problem using tshark
… 1005 122.895749299 10.1.1.13 -> 10.1.1.17 TCP 18890 50770 > 5001 [ACK] Seq=35742433 Ack=1 Win=32761 Len=18824 TSval=1335798940 TSecr=1334065961 1009 122.896252205 10.1.1.13 -> 10.1.1.17 TCP 23234 [TCP Previous segment lost] 50770 > 5001 [ACK] Seq=35956737 Ack=1 Win=32761 Len=23168 TSval=1335798940 TSecr=1334065961 …
Wireshark Overview
Wireshark open-source packet analyzer Wireshark (for Windows) is a free and open-source packet analyzer. A network packet analyzer will try to capture network packets and tries to display that packet data as detailed as possible. Can be used for network analysis and troubleshooting Has a graphical frontend, plus integrated sorting and filtering options. Wireshark filters: – The capture filter syntax is the same as the one used by programs using the pcap library (Berkeley Packet Filter), such as tcpdump, and must be set before launching the Wiershark capture. – The display filters are used to search inside captured data obtained with a capture filter or from another program like iptrace and tcpdump, and the search capabilities are more extended than those of the capture filter and it is not necessary to restart the capture to change filter.
With Wireshark there are also several command line tools, such as tshark (the GLI version of Wireshark GUI, or dumpcap as capture engine) – tshark can be installed on PowerLinux running RHEL or SUSE
For more information on Wireshark and download: Documentation – http://www.wireshark.org/download/docs/user-guide-a4.pdf Download – http://www.wireshark.org/download.html © Copyright IBM Corporation 2015
Wireshark Graphical user interface Command menus
Filter Specification
List of captured packets
Details of selected packet header
Packet content in hexadecimal and ASCII Š Copyright IBM Corporation 2015
Using Wireshark GUI - example
Analyze existing tcpdump / iptrace
Iptrace collected on the target system
© Copyright IBM Corporation 2015
18
Filtering Packets •
You may apply a filter by typing it into the filter box at the top of the window and clicking Apply (or pressing Enter). –
For example, type “TCP” and you’ll see only TCP packets. When you start typing, Wireshark will help you auto complete your filter. No Filter All the packets are visible
Filter : TCP After applying the filter only TCP packets are shown
© Copyright IBM Corporation 2015
19
Create you own Filter •
Click the Analyze menu and select Display Filters to create a new filter
© Copyright IBM Corporation 2015
20
Create you own Filter - Example •
In the below example I have created a filter to analyze the UDP packets which has packet length of more then 100.
© Copyright IBM Corporation 2015
21
Exercise your Wiresharking using samples from the wildâ&#x20AC;Ś You can find a collection of strange and exotic sample capture files at: â&#x20AC;&#x201C; http://wiki.wireshark.org/SampleCaptures
Wireshark some customer cases
Customer case #1 – TCP segments lost • • •
Customer had a support case with IBM Support which got stuck in L2 Customer had a couple of Power 780s, partitions with dedicated Ethernet Adapters (IVE) Symptoms: – –
•
We did: – – –
•
Simple workload test between two partitions with similar configuration as the production partitions which experienced the network flow issues Using iperf for tunable load profile Using iptrace to collect the network traffic from the sending partition
Conclusion: – –
•
“…traffic is very slow and stops when RFC1323 is enabled…” “…when we disable RFC1323, the traffic flows but do not reach expected throughput…”
With both LARGESEND AND RFC1323 enabled on the network interface (enX), the traffic grinded to a halt With either LARGESEND OR RFC1323 disabled, the traffic flowed
Opening the iptrace in Wireshark >>>> –
Most previous TCP segments were lost…
Network workload test with iperf iptrace and wireshark summary
© Copyright IBM Corporation 2015
25
Network workload test with iperf iptrace and wireshark summary
Wireshark > Open iptrace Analysis > Expert Info Composite
Bad Checksum 38764 Previous segment lost 22085
Š Copyright IBM Corporation 2015
26
Network workload test with iperf iptrace and wireshark summary
Analyze > TCP Stream Graph > Throughput Graph
STOP
© Copyright IBM Corporation 2015
27
Network workload test with iperf iptrace and wireshark summary
SPIKE ~123s
Statistics > IO Graphs Under Y Axis Units > Advanced > Graph 2: Calc: COUNT(*) frame.time_delta_displayed Graph 3: Calc: COUNT(*) ip.len Graph 4: Calc: COUNT(*) tcp.len © Copyright IBM Corporation 2015
28
Network workload test with iperf iptrace and wireshark summary
DROP & STOP
SPIKE RESTART
Statistics > IO Graphs Graph 1 •Modify X Axes X Axes: Tick Interval 0.1 sec, Pixels per tick 5 • Modify Y Axes Y Axes: Scale 20
Statistics > IO Graphs Under Y Axis Units > Advanced > •X Axes Tick Interval 0.1 sec, Pixels per tick 5 • Y Axes Scale 20
© Copyright IBM Corporation 2015
29
Customer case #2 – Delay sending TCP ACK • • •
Customer have performance issues with their interlocking banking systems and branches Customer have various Power Systems, from 795s to 595s etc located in different countries Symptoms: –
•
We did: – –
•
Performance data collection set without iptrace Performance data collection set with iptrace
Conclusion: –
•
“…slow response times…”
The TCP ACKs are delayed, consider enabling tcp_nodelayack to avoid piggybacking the ACKs and avoid the fast timeout delay of up to 200ms (~100ms/average)
Opening the iptrace in Wireshark >>>> • The ACKs are flowing properly, except intermittently when some ACKs are several seconds delayed (from application level)
Filter expression used: (ip.addr eq 10.2.50.132 and ip.addr eq 10.2.50.1) and (tcp.port eq 4295 and tcp.port eq 1521)
We notice the "request" packet comes to 10.2.50.1 and it responds immediately for most of them (in less than 1 millisecond most of the times). But every now and then, 10.2.50.1 takes few seconds to respond for few requests. â&#x20AC;&#x201C; Such as request (packet # 14618) comes, but the response is going only after ~3.7s and that is the reason a delayed ACK to the request is sent after 150 milliseconds. So even if we immediately send an ACK (packet # 15646) without waiting for 148 milliseconds, it is not going to help the performance here, as in any case the response is going out only after ~3.7s. Once the connection is established most of the times the request/response happens fast (though we notice cases where the time delta between requests will be around a second and at times the response takes 200 ms etc) The application team is investigating why this behaviour is seen for almost all the connections. Š Copyright IBM Corporation 2015
31
IP Trace Findings • •
Only enable tcp_nodelayack if the actual network traffic workload require This sampled iptrace data indicate that tcp_nodelayack is not required for this case – –
– –
Response from DB (port 1531 - mapped as rap-listen), is almost immediate (in <= 20 milliseconds) By enabling tcp_nodelayack (tcp_nodelayack=1), the receiver is sending an ACK to sender after every received packet, which result in additional network packets to be sent for each packet with data transferred and significantly increase the network load. If disabled (default) the ACK will be sent with the response to the sender or at the latest with a delay up to 200ms (default). If multiple partitions have packet rates which put the load on VIOS SEA above 100K packets/s and enabling tcp_nodelayack for all partitions will basically double the packet rate to ~200K packets/s. • If the traffic bridge over SEAs between hosts, the network stack will have unnecessary high traffic volume
Filter used - (ip.addr eq 172.30.1.60 and ip.addr eq 172.30.1.80) and (tcp.port eq 50838 and tcp.port eq 1531) © Copyright IBM Corporation 2015
32
IP Trace Findings • •
Only enable tcp_nodelayack if the actual network traffic workload require This sampled iptrace data indicate that tcp_nodelayack is not required for this case – – – – –
We notice the "request" packet comes to 10.2.50.1 and it responds immediately for most of them (in less than 1 millisecond most of the times). But every now and then, 10.2.50.1 takes few seconds to respond for few requests. Such as request (packet # 14618) comes, but the response is going only after ~3.7s and that is the reason a delayed ACK to the request is sent after 150 milliseconds. So even if we immediately send an ACK (packet # 15646) without waiting for 148 milliseconds, it is not going to help the performance here, as in any case the response is going out only after ~3.7s. Once the connection is established most of the times the request/response happens fast (though we notice cases where the time delta between requests will be around a second and at times the response takes 200 ms etc)
Filter used: (ip.addr eq 10.2.50.132 and ip.addr eq 10.2.50.1) and (tcp.port eq 4295 and tcp.port eq 1521) © Copyright IBM Corporation 2015
33
Identifying some common network issues
Identifying some common network issues
IP Address, Port and Protocol While debugging TCP network issues, the first step is to identify the IP Address and Port so that the debugging effort can be streamlined iptrace, without any options, captures all the network traffic. It is therefore important to understand how to identify the IP address and ports so that the debugging effort can be streamlined The following example includes ping, telnet and ssh commands to different IPs. The next few slides will explain how to examine the protocol using the Wireshark tool
# startsrc -s iptrace -a "-a /tmp/iptrace_all" [4587572] 0513-059 The iptrace Subsystem has been started. Subsystem PID is 4587572. # ping -c 1 9.122.214.107 PING 9.122.214.107 (9.122.214.107): 56 data bytes 64 bytes from 9.122.214.107: icmp_seq=0 ttl=242 time=23 ms --- 9.122.214.107 ping statistics --1 packets transmitted, 1 packets received, 0% packet loss round-trip min/avg/max = 23/23/23 ms # telnet 9.182.76.224 Trying... Connected to 9.182.76.224. Escape character is '^]'. telnet (ctx7p02.in.ibm.com) AIX Version 7 Copyright IBM Corporation, 1982, 2011. login: Connection closed. # ssh 9.122.214.47 root@9.122.214.47's password: Š Copyright IBM Corporation 2015
35
IP Address, Port and Protocol â&#x20AC;&#x201C; Identify Ping
The packets section shows several fields such as the source IP, destination IP, protocol etc. The Info section gives a brief description of the packet Highlighting the packet shows the header and data in the respective views The header view shows the complete protocol stack header from the ethernet to the application, in this case the ICMP (ping)
Š Copyright IBM Corporation 2015
36
IP Address, Port and Protocol â&#x20AC;&#x201C; Identify Telnet
The three way TCP three way handshake of the telnet consists of [SYN], [SYN, AKC] and [ACK] packets. The 1st and the 3rd packet are sent by the source and the 2nd by the destination The header source and destination IPs in the IP header The protocol can by identified observing the protocol header. TCP in this example The protocol header for TCP shows the port number 23 that is used by telnet application.
Š Copyright IBM Corporation 2015
37
Identifying some common network issues
Round trip time Round trip time (RTT) is the length of time it takes for a packet to be sent plus the length of time it takes for an acknowledgment of that packet to be received Round trip time can be measured at different levels or protocols. For example :– ping and traceroute commands can be used to find the round trip time at the IP layer – iptrace can be used to find the round trip time at the TCP and also the application layer – For a DNS lookup, difference between the timestamps of request and response will be the round trip time
© Copyright IBM Corporation 2015
38
Round trip time example: DNS server lookup To determine the DNS lookup round trip time: – Start iptrace with the IP of the DNS server as the host parameter (destination in our case) – Perform a DNS lookup – Stop the trace An abnormally high the DNS lookup time will slow down applications that use DNS lookups Use a DNS server that is as close as possible to the server to minimize DNS lookup time In the next two slides, two scenarios are illustrated for DNS lookup. One DNS server is in close proximity to the requesting server and the other is in a different continent. As the fold increase is observed in the DNS lookup time in the two scenarios
© Copyright IBM Corporation 2015
39
Round trip time example: DNS server lookup Scenario 1: DNS lookup time 26 milliseconds # startsrc -s iptrace -a "-a -b -s 9.184.192.240 /tmp/iptrace_local_dns" [4587574] 0513-059 The iptrace Subsystem has been started. Subsystem PID is 4587574. # nslookup host.sample.com Server: 9.184.192.240 Address: 9.184.192.240#53 Non-authoritative answer: Name: host.sample.com Address: 9.182.76.38 # stopsrc -s iptrace 0513-044 The iptrace Subsystem was requested to stop. iptrace: unload success!
The time column shows the value in seconds
Š Copyright IBM Corporation 2015
40
Round trip time example: DNS server lookup Scenario 2: DNS lookup time 247 milliseconds # startsrc -s iptrace -a "-a -b -s 9.3.36.243 /tmp/iptrace_remote_dns" [4587576] 0513-059 The iptrace Subsystem has been started. Subsystem PID is 4587576. # nslookup remote_host.sample.com 9.3.36.243 Server: 9.3.36.243 Address: 9.3.36.243#53 Name: remote_host.sample.com Address: 9.3.36.37 # stopsrc -s iptrace 0513-044 The iptrace Subsystem was requested to stop. iptrace: unload success!
The time column shows the value in seconds
Š Copyright IBM Corporation 2015
41
Identifying some common network issues
TCP Retransmission TCP starts a retransmission timer when each outbound segment is handed down to IP. If no acknowledgment has been received for the data in a given segment before the timer expires, then the segment is retransmitted. There are some circumstances under which TCP will retransmit data prior to the retransmission timer expiring. The most common of these occurs due to a feature known as fast retransmit. What scenarios can trigger TCP Retransmissions? – Several packets are dropped due to faulty network – The destination server is not able to receive packets because it has crashed or the TCP IP subsystem has crashed or is not responding – The destination server is not able to send ACK (acknowledgment) packets within the retransmission timeout or the ACK packets are lost due to a faulty network
© Copyright IBM Corporation 2015
42
TCP Retransmission – example (1 of 3) # startsrc -s iptrace -a "-a -b -s 9.182.76.224 /tmp/iptrace_retransmission" [4587578] 0513-059 The iptrace Subsystem has been started. Subsystem PID is 4587578. # ftp 9.182.76.224 Connected to 9.182.76.224. 220 ctx7p02.in.ibm.com FTP server (Version 4.2 Mon May 9 19:10:31 CDT 2011) ready. Name (9.182.76.224:root): root 331 Password required for root. Password: 230-Last unsuccessful login: Wed Mar 28 10:57:14 2012 on ssh from 9.124.39.210 230-Last login: Thu Mar 29 22:46:11 2012 on /dev/vty0 230 User root logged in. ftp> put /tmp/testfile 421 Service not available, remote server has closed connection ftp> bye 221 Goodbye. # stopsrc -s iptrace 0513-044 The iptrace Subsystem was requested to stop. iptrace: unload success!
© Copyright IBM Corporation 2015
43
TCP Retransmission – example (2 of 3) Following steps are performed to observe TCP Retransmission – Initiate a FTP session to a destination server – Freeze the destination server (by putting the server into KDB) {stglbs34}:/#kdb WARNING: Version mismatch between unix file and command kdb START END <name> 0000000000001000 0000000004070000 start+000FD8 F00000002FF47600 F00000002FFDF9C0 __ublock+000000 000000002FF22FF4 000000002FF22FF8 environ+000000 000000002FF22FF8 000000002FF22FFC errno+000000 F1000F0A00000000 F1000F0A10000000 pvproc+000000 F1000F0A10000000 F1000F0A18000000 pvthread+000000 read vscsi_scsi_ptrs OK, ptr = 0x41C03A0 (0)> exit
On the source server, try to transfer a file Observations – FTP will try to transmit data – Since the destination server is frozen and hence cannot respond, the source server start sending TCP retransmission packets The next slide shows the observations in the Wireshark tool for the trace captured in the previous slide Note : The destination system in the above example is put in kdb to simulate the TCP retransmission and should not be tried in the production environment © Copyright IBM Corporation 2015
44
TCP Retransmission â&#x20AC;&#x201C; example (3 of 3)
Examine the time difference between successive retransmissions: with rounding time difference between the retransmission increase exponentially i.e if we observe packet no: 17,18,19 the time difference 2,4,8 seconds respectively, which means that the timeout value is doubled for each retransmission, with an upper limit of 64 seconds. This doubling is called an exponential backoff
Š Copyright IBM Corporation 2015
45
Identifying some common network issues
TCP Connection Reset A TCP Connection Reset (RST) indicates that the destination server is receiving the packet but there is no application bound to that port Since no application is bound to the port, the packet cannot be received and processed by any application on the server What scenarios can trigger TCP Connection Reset (RST) ? – TCP Connection Resets (RST) can happen when incorrect port number is used. – TCP Connection Resets (RST) can also happen when the application that is bound to a particular port crashes or is shutdown.
© Copyright IBM Corporation 2015
46
TCP Connection Reset example 1 (1 of 2) # startsrc -s iptrace -a "-a -b -s 9.182.76.224 /tmp/iptrace_rst_telnet" [13172892] 0513-059 The iptrace Subsystem has been started. Subsystem PID is 13172892. # telnet 9.182.76.224 4567 Trying... telnet: connect: Connection refused # stopsrc -s iptrace 0513-044 The iptrace Subsystem was requested to stop. iptrace: unload success!
The telnet daemon listens on port 23. telnet to a random port on which the telnet daemon does not listen. In the above example, a telnet request to the IP address 9.182.76.224 is tried on port 4567 The next slide shows the TCP RST packets in the Wireshark tool for the trace captured in the above example: – The RST packets are coming from server 9.182.76.244 – The TCP header shows that the RST flag is set
© Copyright IBM Corporation 2015
47
TCP Connection Reset example 1 (2 of 2)
The RST flag is set in the TCP header as we are trying to establish the telnet connection on port 4567
© Copyright IBM Corporation 2015
48
TCP Connection Reset example 2 (1 of 3) # startsrc -s iptrace -a "-a -b -s 9.182.76.224 /tmp/iptrace_rst_ssh" [13172908] 0513-059 The iptrace Subsystem has been started. Subsystem PID is 13172908. # ssh 9.182.76.224 root@9.182.76.224's password: Last unsuccessful login: Wed Mar 28 10:57:14 2012 on ssh from 9.124.39.210 Last login: Thu Mar 29 23:04:08 2012 on ftp from ::ffff:9.182.76.223 ******************************************************************************* * * * * * Welcome to AIX Version 7.1! * * * * * * Please see the README file in /usr/lpp/bos for information pertinent to * * this release of the AIX Operating System. * * * * * ******************************************************************************* # # Connection to 9.182.76.224 closed by remote host. Connection to 9.182.76.224 closed. # stopsrc -s iptrace 0513-044 The iptrace Subsystem was requested to stop. iptrace: unload success!
Š Copyright IBM Corporation 2015
49
TCP Connection Reset example 2 (2 of 3) The example demonstrated by using the following steps – Start iptrace on the source – ssh to a destination server and login with the right credentials – Using a different session (vterm or telnet but not ssh) to the destination server, kill the ssh daemon – The ssh session to the destination will be terminated Observations – Because the ssh daemon is killed, there is not program (application) to listen on the ssh port number 22 – The destination server (9.182.76.244 in this example) will send a TCP RST packet to the source server The observations are confirmed by analyzing the the iptrace file in the wireshark tool. The next slide has the screenshots
© Copyright IBM Corporation 2015
50
TCP Connection Reset example 2 (3 of 3)
© Copyright IBM Corporation 2015
51
Identifying some common network issues
Duplicate Ack DESTINATION IP
SOURCE IP
Issue: In the above scenario Sender , re-transmits the packet as the ack on the receiver side is lost , the sender sends a “DUP ACK”. Resolution: Network option "tcprexmtthresh" is introduced which specifies how many consecutive duplicate ACK's are allowed before a TCP connection goes to fast retransmit phase. By default the value of "tcprexmtthresh" is 3. So for every three consecutive duplicate ACK's TCP will goto fast retransmit phase. However user can modify the value of "tcprexmtthresh" using /usr/sbin/no command to change this behavior.
# no -a | grep tcprexmtthresh tcprexmtthresh = 3 © Copyright IBM Corporation 2015
52
Identifying some common network issues
TCP large send offload The TCP large send offload option allows the AIX TCP layer to build a TCP message up to 64 KB long and send it in one call down the stack through IP and the Ethernet device driver. The adapter then re-segments the message into multiple TCP frames to transmit on the wire. The TCP packets sent on the wire are either 1500 byte frames for a MTU of 1500 or up to 9000 byte frames for a MTU of 9000 (jumbo frames). The TCP large send offload option reduces host processing and results in lower CPU utilization on the host CPU because segmentation of data is performed at the Ethernet layer. The savings varies depending on the average TCP large send size. For example, you can see a reduction of host CPU by 60 to 75% with the PCI-X GigE adapters with a MTU size of 1500. For jumbo frames, the savings are less because the system already sends larger frames. For example, you can see a reduction of host CPU by 40% with jumbo frames. However, for best raw throughput, you should not enable this option because the data rate on the wire is slower with this option enabled.
Š Copyright IBM Corporation 2015
53
TCP large send offload # ifconfig en0 en0: flags=1e080863,480<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT, CHECKSUM_OFFLOAD(ACTIVE),CHAIN> inet 9.182.76.223 netmask 0xffffff00 broadcast 9.182.76.255 tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1 # ifconfig en0 largesend # ifconfig en0 en0: flags=1e080863,4c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT, CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN> inet 9.182.76.223 netmask 0xffffff00 broadcast 9.182.76.255 tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1 # ## lsattr -El en1 | grep largesend largesend on Enable/Disable Largesend attribute #
True
The ifconfig command can be used to check if TCP large send is enabled on a ethernet interface (adapter) The ifconfig can also be used to enable the TCP large send option for a particular ethernet interface (adapter) The TCP checksum offload option enables the network adapter to compute the TCP checksum on transmit and receive, which saves the AIX® host CPU from having to compute the checksum. The savings for MTU 1500 are typically about 5% reduction in CPU utilization, and for MTU 9000 (Jumbo Frames) the savings is approximately a 15% reduction in CPU utilization. For large_Send to work the TCP checksum offload should be enabled however vice-versa is not true. © Copyright IBM Corporation 2015
54
TCP large send offload example (1 of 3) # startsrc -s iptrace -a "-a -b -s 9.182.76.224 /tmp/iptrace_largesend_dd" [13172912] 0513-059 The iptrace Subsystem has been started. Subsystem PID is 13172912. # ftp 9.182.76.224 Connected to 9.182.76.224. 220 ctx7p02.in.ibm.com FTP server (Version 4.2 Mon May 9 19:10:31 CDT 2011) ready. Name (9.182.76.224:root): root The dd command reads the InFile parameter or standard input, does the specified conversions, then copies the converted data to the OutFile parameter or standard output 331 Password required for root. Password: 230-Last unsuccessful login: Wed Mar 28 10:57:14 2012 on ssh from 9.124.39.210 230-Last login: Thu Mar 29 23:24:56 2012 on /dev/pts/0 from 9.182.76.223 230 User root logged in. ftp> put "|dd if=/dev/zero bs=32k count=1000" /dev/null 200 PORT command successful. 150 Opening data connection for /dev/null. 1000+0 records in 1000+0 records out 226 Transfer complete. 32768000 bytes sent in 0.2528 seconds (1.266e+05 Kbytes/s) local: |dd if=/dev/zero bs=32k count=1000 remote: /dev/null ftp> bye 221 Goodbye. # stopsrc -s iptrace 0513-044 The iptrace Subsystem was requested to stop. iptrace: unload success!
The TCP large send offload option is enabled on the interface (see previous slide) A ftp session is started to the destination server The dd command is used to generate network traffic Š Copyright IBM Corporation 2015
55
TCP large send offload example (2 of 3) The following steps can be used to demonstrate TCP large send offload : – The TCP large send offload option is enabled on the interface (see previous slide) – A ftp session is started to the destination server – The dd command is used to generate network traffic Observations – TCP Packets larger than 1500 bytes are sent by the source server – In this particular example, packet with sizes of 64K bytes and 32K bytes are sent The next slide shows a screenshot with large packets: 64K bytes and 32K bytes
© Copyright IBM Corporation 2015
56
TCP large send offload example (3 of 3)
With large_send enabled we can see 64K and 32K
© Copyright IBM Corporation 2015
57
Identifying some common network issues
somaxconn somaxconn : It is a “no” parameter which determines the maximum number of connections in a connection request queue. – The default value for somaxconn is 1024. # no -a | grep somaxconn somaxconn = 1024
– Changing the somaxconn value using “no” # no -o somaxconn=50 Setting somaxconn to 50 Change to tunable somaxconn, will only be effective for future connections # no -a | grep somaxconn somaxconn = 50
Note : “somaxconn” value is changed here for problem simulation purpose , however in production environment the value should be tuned optimally as per the application requirement.
© Copyright IBM Corporation 2015
58
listen Subroutine and somaxconn Purpose – Listens for socket connections and limits the backlog of incoming connections. Syntax – #include <sys/socket.h> int listen ( Socket, Backlog) int Socket, Backlog; Description – The listen subroutine performs the following activities: 1. Identifies the socket that receives the connections. 2. Marks the socket as accepting connections. 3. Limits the number of outstanding connection requests in the system queue. The outstanding connection request queue length limit is specified by the parameter backlog per listen call. A no parameter - somaxconn - defines the maximum queue length limit allowed on the system, so the effective queue length limit will be either backlog or somaxconn, whichever is smaller.
© Copyright IBM Corporation 2015
59
somaxconn and Performance impact (1 of 2) Client Client Socket
Server Port 20005
somaxconn=50
Server Socket
Backlog=5
iptrace on Client iptrace will show SYN packet coming into the port server application is listening on but SYN/ACK is not going back to client
iptrace on Server
Š Copyright IBM Corporation 2015
60
somaxconn and Performance impact (2 of 2) # netstat -a | grep 20005 tcp4 0 0 ctx7p01.20005 tcp4 0 0 ctx7p01.20005 tcp4 0 0 ctx7p01.20005 tcp4 0 0 ctx7p01.20005 tcp4 0 0 ctx7p01.20005 tcp4 0 0 ctx7p01.20005 tcp4 0 0 ctx7p01.20005 tcp4 0 0 ctx7p01.20005 tcp4 0 0 ctx7p01.20005 tcp4 0 0 *.20005
ctx7p02.56935 ctx7p02.56936 ctx7p02.56937 ctx7p02.56938 ctx7p02.56939 ctx7p02.56940 ctx7p02.56941 ctx7p02.56942 ctx7p02.56943 *.*
ESTABLISHED ESTABLISHED ESTABLISHED ESTABLISHED ESTABLISHED ESTABLISHED ESTABLISHED ESTABLISHED ESTABLISHED LISTEN
In this example we can see that 447 connections are being discarded as the listner’s queue was full. In such scenarios it is important to communicate with the application team and set the somaxconn/backlog value in such a way that we avoid the situation where we hit the performance bottleneck.
# netstat -s -p tcp tcp: 8762 packets sent 2836 data packets (471687 bytes) 1 data packet (6 bytes) retransmitted 3471 ack-only packets (1064 delayed) 0 URG only packets 0 window probe packets 622 window update packets 3664 control packets 0 large sends 0 bytes sent using largesend 0 bytes is the biggest largesend 10131 packets received 4621 acks (for 473005 bytes) 614 duplicate acks 0 acks for unsent data 4433 packets (500387 bytes) received in-sequence 2 completely duplicate packets (6 bytes) 0 old duplicate packets 0 packets with some dup. data (0 bytes duped) 601 out-of-order packets (0 bytes) 0 packets (0 bytes) of data after window 0 window probes 597 window update packets 0 packets received after close 0 packets with bad hardware assisted checksum 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded by listeners 447 discarded due to listener's queue full 105 ack packet headers correctly predicted 1949 data packet headers correctly predicted 615 connection requests 623 connection accepts 1224 connections established (including accepts) 1238 connections closed (including 15 drops) -----MORE OUTPUT----------->
© Copyright IBM Corporation 2015
61
Identifying some common network issues
TCP Enhancements : tcp_fastlo & tcp_fastlo_crosswpar (1 of 2) “tcp_fastlo” is a new network option for enabling a TCP fast loopback feature. Using the TCP fast loopback feature can significantly reduce TCP/IP protocol overhead and lower CPU utilization if the two (TCP) communication endpoints reside on the same logical partition (LPAR). – To enable the fastpath of the TCP loopback traffic use the no command, by entering # no -o tcp_fastlo=1 This option is dynamic and is effective for future TCP connections A second option, tcp_fastlo_crosswpar, enables TCP fastpath loopback to work between workload partitions (wpar). The tcp_fastlo option must be enabled for the tcp_fastlo_crosswpar option to function. – To enable the fastpath of the TCP loopback traffic between workload partitions (wpar), use the no command, by entering # no -o tcp_fastlo_crosswpar=1
Both these new parameters are available in AIX 7 Technology Level 1 and AIX 6 Technology Level 7.
© Copyright IBM Corporation 2015
62
TCP Enhancements : tcp_fastlo & tcp_fastlo_crosswpar (1 of 2)
# no -o tcp_fastlo=0 Setting tcp_fastlo to 0
# no -o tcp_fastlo=1 Setting tcp_fastlo to 1
# ftp localhost Connected to loopback. ftp> put "|dd if=/dev/zero bs=1k count=100000" /dev/null
# ftp localhost Connected to loopback. ftp> put "|dd if=/dev/zero bs=1k count=100000" /dev/null 226 Transfer complete. 102400000 bytes sent in 0.795 seconds (1.258e+05 Kbytes/s)
226 Transfer complete. 102400000 bytes sent in 0.8873 seconds (1.127e+05 Kbytes/s)
ftp> put "|dd if=/dev/zero bs=64k count=100000" /dev/null
ftp> put "|dd if=/dev/zero bs=64k count=100000" /dev/null
226 Transfer complete. 6553600000 bytes sent in 37.72 seconds (1.696e+05 Kbytes/s)
226 Transfer complete. 6553600000 bytes sent in 34.63 seconds (1.848e+05 Kbytes/s)
Š Copyright IBM Corporation 2015
63
Identifying some common network issues
UDP – Size of Packet Frame Length = UDP header(8 bytes )+ IP header (20 bytes) +ethernet header(14 bytes)+ data(512 bytes)
Size of the packet
© Copyright IBM Corporation 2015
66
Identifying some common network issues
UDP socket over flow ( 1of 3) The UDP socket buffer is one of the place where a server drops packets. These dropped packets are counted by the UDP layer and you can see the statistics by using the netstat -p udp command. Client
Server UDP
Client Socket
Server Socket
Port = 36109
Port=9930
Buffer Length = 5536
Buffer Length = 65536
Socket buffer overflows could be due to insufficient transmit and receive UDP sockets, too few nfsd daemons, or too small nfs_socketsize (in case of nfs),udp_recvspace and sb_max values.If the netstat -p udp command indicates socket overflows. First, check the affected system for CPU or I/O saturation, and verify the recommended setting for the other communication layers by using the no -a command. If the system is saturated, you must either to reduce its load or increase its resources.
Š Copyright IBM Corporation 2015
67
UDP socket over flow ( 2of 3)
Iteration 1
# no -a | grep udp udp_bad_port_limit = 0 udp_ephemeral_high udp_ephemeral_low udp_inpcb_hashtab_siz udp_pmtu_discover udp_recvspace udp_sendspace udp_ttl udpcksum
# netstat -Zs # netstat -s -p udp udp: 2 datagrams received 0 incomplete headers 0 bad data length fields 0 bad checksums 0 dropped due to no socket 2 broadcast/multicast datagrams dropped due to no socket 0 socket buffer overflows 0 delivered 0 datagrams output
= = = = = = = =
65535 32768 24499 1 42080 9216 30 1
Iteration 2 # netstat -s -p udp udp: 1020 datagrams received 0 incomplete headers 0 bad data length fields 0 bad checksums 761 dropped due to no socket 4 broadcast/multicast datagrams dropped due to no socket 222 socket buffer overflows 33 delivered 10 datagrams output
Iteration 3
Iteration 4
# netstat -s -p udp udp: 4097 datagrams received 0 incomplete headers 0 bad data length fields 0 bad checksums 3704 dropped due to no socket 8 broadcast/multicast datagrams dropped due to no socket 245 socket buffer overflows 140 delivered 33 datagrams outputoutput
# netstat -s -p udp udp: 5101 datagrams received 0 incomplete headers 0 bad data length fields 0 bad checksums 4669 dropped due to no socket 10 broadcast/multicast datagrams dropped due to no socket 263 socket buffer overflows 159 delivered 35 datagrams output
Š Copyright IBM Corporation 2015
68
UDP socket over flow ( 3 of 3) # # no -o udp_recvspace=262144 Setting udp_recvspace to 262144 Change to tunable udp_recvspace, will only be effective for future connections # no -a | grep udp udp_bad_port_limit = 0 udp_ephemeral_high = 65535 udp_ephemeral_low = 32768 udp_inpcb_hashtab_siz = 24499 udp_pmtu_discover = 1 udp_recvspace = 262144 udp_sendspace = 9216 udp_ttl = 30 udpcksum = 1
Iteration 2 # netstat -Zs # netstat -s -p udp udp: 1042 datagrams received 0 incomplete headers 0 bad data length fields 0 bad checksums 941 dropped due to no socket 2 broadcast/multicast datagrams dropped due to no socket 0 socket buffer overflows 99 delivered 10 datagrams output
Iteration 1
# netstat -Zs # netstat -s -p udp udp: 0 datagrams received 0 incomplete headers 0 bad data length fields 0 bad checksums 0 dropped due to no socket 0 broadcast/multicast datagrams dropped due to no socket 0 socket buffer overflows 0 delivered 0 datagrams output
Iteration 3 # netstat -s -p udp udp: 3083 datagrams received 0 incomplete headers 0 bad data length fields 0 bad checksums 2852 dropped due to no socket 6 broadcast/multicast datagrams dropped due to no socket 0 socket buffer overflows 225 delivered 23 datagrams output
Iteration 4 # netstat -s -p udp udp: 5314 datagrams received 0 incomplete headers 0 bad data length fields 0 bad checksums 3811 dropped due to no socket 102 broadcast/multicast datagrams dropped due to no socket 0 socket buffer overflows 1401 delivered 374 datagrams output
You should set the value for the udp_recvspace tunable high due to the fact that multiple UDP datagrams might arrive and wait on a socket for the application to read them. Also, many UDP applications use a particular socket to receive packets. This socket is used to receive packets from all clients talking to the server application. Therefore, the receive space needs to be large enough to handle a burst of datagrams that might arrive from multiple clients, and be queued on the socket, waiting to be read. If this value is too low, incoming packets are discarded and the sender has to retransmit the packet. This might cause poor performance. Š Copyright IBM Corporation 2015
69
Identifying some common network issues
Embryonic connection The term half-open connection is most often used to describe an embryonic connection, i.e. a TCP connection which is in the process of being established. – Example: 1. The originating endpoint (A) sends a SYN packet to the destination (B). A is now in an embryonic state (specifically, SYN_SENT), and awaiting a response
A
B
2. B now updates its kernel information to indicate the incoming connection from A, and sends out a request to open a channel back (the SYN/ACK packet). At this point, B is also in an embryonic state (specifically, SYN_RCVD). Note that B was put into this state by another machine, outside of B's control.
The Sender is in embryonic state as it did not get any response from the receiver for the “SYN” packet © Copyright IBM Corporation 2015
70
Identifying some common network issues
IP Fragmentation Client Client Socket
Server UDP
Server Socket
Data size=5K MTU (The MTU is the maximum packet size (including all headers) that can be transmitted on a network )= 1500
Data = 5k payload:0-1479 (1480 bytes)
payload:1480-2959 (1480 bytes)
payload:2960-4439 (1480 bytes)
payload:4440-5127 (688 bytes)
The MTU is the maximum packet size (including all headers) that can be transmitted on a network If two hosts are communicating across a path of different networks, a transmitted packet becomes fragmented if its size is greater than the smallest MTU of any network in the path. For local networks, if the data is larger than the maximum transmission unit (MTU), TCP breaks it into appropriately sized fragments . UDP leaves the fragmentation to the IP layer. The interface (IF) layer makes sure that no packet exceeds the MTU. The receiving host places the incoming packets on the adapter's receive queue. They're then passed up to the IP layer, which determines if any fragmentation has occurred due to the MTU. If so, it restores the fragments to their original form and passes the packets to TCP or UDP Š Copyright IBM Corporation 2015
71
Identifying some common network issues
TCP Delays: Nagle Algorithm (1 of 2) TCP request and response workloads are workloads that involve a two-way exchange of information. Examples of request and response workloads are Remote Procedure Call (RPC) type of applications, such as NFS file systems or client/server applications, such as web browser requests to a web serves, other communication protocols such as telnet, ssh, etc. Many of these requests or responses use small messages Nagleâ&#x20AC;&#x2122;s Algorithm addresses the 'small packet problem', in which an application repeatedly emits data in small chunks, frequently only few byte in size. Since TCP packets have a 40 byte header (20 bytes for TCP, 20 bytes for IPv4), this results in say, a 41 byte packet for 1 byte of useful information, a huge overhead. This situation often occurs in Telnet sessions, where most key presses generate a single byte of data that is transmitted immediately Nagle's algorithm works by combining a number of small outgoing messages, and sending them all at once. Specifically, as long as there is a sent packet for which the sender has received no acknowledgment, the sender should keep buffering its output until it has a full packet's worth of output, so that output can be sent all at once.
Š Copyright IBM Corporation 2015
72
TCP Delays: Nagle Algorithm (2 of 2) TCP Nagle algorithm tunables: – tcp_nagle_limit: TCP disables the Nagle algorithm for segments equal or larger than this value so it is possible to tune the threshold at which nagle is enabled. For example, to totally disable nagle, set the tcp_nagle_limit value to 1. To allow TCP to bundle up sends and send packets that are at least 256 bytes, set the tcp_nagle_limit value to 256. tcp_nagle_limit can be set using the no command – tcp_nodelay: Nagle algorithm can be disable at the interface level by setting the interface parameter tcp_nodealy. This can be set using the ifconfig command – TCP_NODELAY: To disable Nagle algorithm at the socket level, the TCP_NODELAY option can be passed to the setsockopt subroutine
Note: tcp_nodelay (disabling nagle) can be set by application socketoptions to override system configurations even if set by ISNO network interface device specific
© Copyright IBM Corporation 2015
73
Identifying some common network issues
TCP Delays: Acknowledgement (Delayed ACK) Piggybacking is a technique by which data transmission is made more efficient by avoiding individual acknowledgement (ACK) packets Instead of sending an acknowledgement in an individual frame it is piggy-backed on the next data frame Therefore, in Piggybacking, the acknowledgement will have to wait will the next data for the next frame is ready On AIX, the default behavior for TCP connections results in delayed acknowledgements By default, AIX will wait up to 200ms to piggy back an acknowledgement. If data for the next frame is not available within the 200ms timeout, an individual acknowledgement frame is sent tcp_nodelayack: the network option (no command) that can disable delayed acknowledgements fasttimo: the network option to reduce the 200 ms timer, which is the default, down to 100 or 50 ms. Reducing this timer adds more overhead to the system because all the TCP connections have to be scanned more often. fasttimo option as a last resort in tuning a system.
Š Copyright IBM Corporation 2015
74
TCP Delays: Nagle algorithm & Delayed Acknowledgement (1 of 3) A combination of Nagle algorithm on the sender and delayed acknowledgement on the receiver can adversely affect application performance Such situation can arise when sender is waiting for an acknowledgement from the receiver, because say the last chuck of data is smaller than the frame (mtu) size AND The TCP layer of receiver is delaying sending the acknowledgement and thus waiting for the 200ms timeout because there is no data to be sent Thus in scenarios in which the request-response spans just few packets, the last packet will be impacted with delayed acknowledgement timeout and the application can suffer a few hundred milliseconds latency In such situations it is beneficial to disable delayed acknowledgement tcp_nodelayack on the receiver In general, since Nagle's algorithm is only a defense against careless applications, it will not benefit a carefully written application that takes proper care of buffering; the algorithm has either no effect, or negative effect on the application.
Š Copyright IBM Corporation 2015
75
TCP Delays: Nagle algorithm & Delayed Ack A combination of Nagle algorithm on the sender and delayed acknowledgement on the receiver can adversely affect application performance Such situation can arise when sender is waiting for an acknowledgement from the receiver, because say the last chuck of data is smaller than the frame (mtu) size AND The TCP layer of receiver is delaying sending the acknowledgement and thus waiting for the 200ms timeout because there is no data to be sent Thus in scenarios in which the request-response spans just few packets, the last packet will be impacted with delayed acknowledgement timeout and the application can suffer a few hundred milliseconds latency In such situations it is beneficial to disable delayed acknowledgement tcp_nodelayack on the receiver In general, since Nagle's algorithm is only a defense against careless applications, it will not benefit a carefully written application that takes proper care of buffering; the algorithm has either no effect, or negative effect on the application.
Š Copyright IBM Corporation 2015
76
TCP Delayed ACK
TCP Sender
TCP Receiver
In this figure for packet1 , packet 2 , packet 3 TCP delays transmission of ACKs The hope is to have data ready in that time frame. Then, the ACK can be piggybacked with a data segment.
Š Copyright IBM Corporation 2015
77
TCP Delayed ACK and fasttimo
TCP Sender
TCP Receiver
200ms
Š Copyright IBM Corporation 2015
(default fasttimo value )
The TCP layer of receiver is delaying sending the acknowledgement and thus waiting for the 200ms timeout because there is no data to be sent
78
TCP Delays: Nagle algorithm with Delayed Ack Enabled # no -a | grep tcp_nodelayack tcp_nodelayack = 0 # startsrc -s iptrace -a "-a -b -s 9.182.76.224 /tmp/iptrace_delayack" [5373992] 0513-059 The iptrace Subsystem has been started. Subsystem PID is 5373992. # date; /tmp/tcp_samples/tcp_server 4567; date Thu Mar 29 15:55:17 2012 Handling client 9.182.76.224 recv() failed: Connection reset by peer Thu Mar 29 16:01:32 2012 # stopsrc -s iptrace 0513-044 The iptrace Subsystem was requested to stop. iptrace: unload success!
The sender and receiver are exchanging very small packets. The default tcp_nodelayack is enabled so that sender sends delayed acknowledgements, we can observe from the above output that acknowledgements are delayed by nearly 200 ms Š Copyright IBM Corporation 2015
79
TCP Delayed ACK Disabled
TCP Sender
TCP Receiver The receiver is not delaying the ACK , it sends the respective acknowledgements as soon as the packet is received.
Š Copyright IBM Corporation 2015
80
TCP Delays: Nagle algorithm with Delayed Ack Disabled # no -o tcp_nodelayack=1 Setting tcp_nodelayack to 1 # no -a | grep tcp_nodelayack tcp_nodelayack = 1 # startsrc -s iptrace -a "-a -b -s 9.182.76.224 /tmp/iptrace_nodelayack" [5373996] 0513-059 The iptrace Subsystem has been started. Subsystem PID is 5373996. # date; /tmp/tcp_samples/tcp_server 4567; date Thu Mar 29 16:07:16 2012 Handling client 9.182.76.224 recv() failed: Connection reset by peer Thu Mar 29 16:07:17 2012 # stopsrc -s iptrace 0513-044 The iptrace Subsystem was requested to stop. iptrace: unload success! #
It can be observed from the trace that there are no delayed acknowledgements, as compared to the slide 50 where ACK were delayed by 200ms. Also the transfer of data is complete in a matter of few seconds Š Copyright IBM Corporation 2015
81
Identifying some common network issues
ARP: Address Resolution Protocol The Address Resolution Protocol is used to discover the mapping between a layer 3 (IP address) and a layer 2 (host adapter or MAC) address. ARP is handled transparently by AIX by maintaining an ARP cache The ARP cache table is composed of a number of buckets (default 149). Each bucket holds a number of entries (default 7). Therefore by default, the ARP cache can hold 1043 (149 x 7) host addresses. The arp command can be used to view and modify the ARP cache:# arp -a ? (9.182.76.143) at e4:1f:13:b9:66:40 [ethernet] stored in bucket 16 host.sample.com (9.182.76.145) at e4:1f:13:b9:6d:4c [ethernet] stored in bucket 18 ? (9.182.76.147) at 0:5:33:6a:db:56 [ethernet] stored in bucket 20 ? (9.182.76.1) at 0:0:c:7:ac:cc [ethernet] stored in bucket 23 bucket: 0 contains: 0 entries bucket: 1 contains: 0 entries # arp -d 9.182.76.224 9.182.76.224 (9.182.76.224) deleted # arp -a | grep 9.182.76.224 # ping -c 1 9.182.76.224 PING 9.182.76.224 (9.182.76.224): 56 data bytes 64 bytes from 9.182.76.224: icmp_seq=0 ttl=255 time=0 ms --- 9.182.76.224 ping statistics --1 packets transmitted, 1 packets received, 0% packet loss round-trip min/avg/max = 0/0/0 ms # arp -a | grep 9.182.76.224 ? (9.182.76.224) at 36:f9:d7:5:78:3 [ethernet] stored in bucket 97 Š Copyright IBM Corporation 2015
82
ARP packets in Wireshark # arp -a | grep 9.182.76.224 # startsrc -s iptrace -a "-b -s 9.182.76.224 /tmp/iptrace_arp" [11534492] 0513-059 The iptrace Subsystem has been started. Subsystem PID is 11534492. # ping -c 1 9.182.76.224 PING 9.182.76.224 (9.182.76.224): 56 data bytes 64 bytes from 9.182.76.224: icmp_seq=0 ttl=255 time=0 ms --- 9.182.76.224 ping statistics --1 packets transmitted, 1 packets received, 0% packet loss round-trip min/avg/max = 0/0/0 ms # stopsrc -s iptrace 0513-044 The iptrace Subsystem was requested to stop.
The screenshot of the Wrieshark tool shows a ARP request before the ping (ICMP) request. The info section shows an ARP request and the response. The response contains the IP address (layer 3) to MAC address (layer 2) mapping.
Š Copyright IBM Corporation 2015
83
ARP issues and tunables (1 of 2) The no command configures network tuning parameters and its ARP-related tunable parameters are: – arpqsize (default 12): Determines how many of packets can be queued by the ARP layer until an ARP response is received back from an ARP request – arpt_killc (default 20 minutes): Is the time, in minutes, before an ARP entry is deleted. ARP entries are deleted from the ARP table every number of minutes defined because the MAC address may change when the server’s network adapter is replaced – arptab_bsiz (default 7): Determines the number of entries in each ARP bucket – arptab_nb (default 149): Determines the number of ARP buckets arptab_bsiz * arptab_nb determines the size of the ARP cache table The arpqsize, arptab_bsiz, and arptab_nb parameters require reboot to take effect The arpt_killc parameter is a dynamic parameter, so it can be changed without rebooting the system
© Copyright IBM Corporation 2015
84
ARP issues and tunables (2 of 2) By default the ARP cache table has 149 buckets with 7 entries each, so the table can hold 1043 (149 x 7) host address mappings This default setting will work for a machine that would be communicating with up to 1043 other machines concurrently on the IP network. If a server connects to more than 1043 machines on the network concurrently, then the ARP table will be too small, causing the ARP table to thrash and resulting in poor performance. AIX must then must purge an entry in the cache and replace it with a new address. When ARP trashing occurs, the TCP or UDP packets have to be queued while the ARP protocol exchanges this information. The arpqsize parameter determines how many of these waiting packets can be queued by the ARP layer until an ARP response is received back from an ARP request. If the ARP queue is overrun, outgoing TCP or UDP packets are dropped. ARP cache thrashing might have a negative impact on performance for the following reasons:– The current outgoing packet has to wait for the ARP protocol lookup over the network – Another ARP entry must be removed from the ARP cache. If all of the addresses are needed, another address is required when the host address that is deleted has packets sent to it. – The ARP output queue might be overrun, which could cause dropped packets.
© Copyright IBM Corporation 2015
85
Thank you – Tack !
Björn Rodén roden@ae.ibm.com http://www.linkedin.com/in/roden © Copyright IBM Corporation 2015
90
90
Growing your IBM skills – a new model for training Meet the authorized IBM Global Training Providers in the Edge Solution Showcase Global Skills Initiative
•
Access to training in more cities local to you, where and when you need it, and in the format you want •
•
Use IBM Training Search to locate training classes near to you
Demanding a high standard of quality / see the paths to success •
Learn about the New IBM Training Model and see how IBM is driving quality
•
Check Training Paths and Certifications to find the course that is right for you
•
Academic Initiative works with colleges and universities to introduce real-world technology into the classroom, giving students the hands-on experience valued by employers in today’s marketplace
•
www.ibm.com/training
© Copyright IBM Corporation 2015
91
91
IBM Systems Lab Services and Training
© Copyright IBM Corporation 2015
92