INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ISBN: 378 - 26 - 138420 - 5
AN EFFECTIVE APPROACH TO ELIMINATE TCP INCAST COLLAPSE IN DATACENTER ENVIRONMENTS S Anil Kumar1, G Rajesh2 1
M.Tech 2nd year, Dept. of CSE, Audisankara College of Engineering & Technology, Gudur, A.P, India.
2
Asst Professor, Dept. of CSE, Audisankara College of Engineering & Technology, Gudur, A.P, India.
Abstract – Transport
Control Protocol (TCP)
I.
many to one communication congestion happens
Transport Control Protocol (TCP) is widely used
in high-bandwidth and low-latency networks when
on the Internet and normally works fine.
two or more synchronized servers send data to the same receiver in parallel.
However, recent works have shown that TCP does
For many key data-
not work well for many-to-one traffic patterns on
center applications such as MapReduce and Search,
this
many-to-one traffic pattern
INTRODUCTION
high-bandwidth, low-latency networks.
is
many-to-one
Congestion occurs when many synchronized
communication congestion may severely degrade
servers under the same Gigabit Ethernet switch
their performances, e.g., by enhancing response
simultaneously send data to one receiver in
time. In this paper, we explore the many-to-one
parallel. Only after all connections have finished
communication by focusing on the relationships
the data transmission can the next round be issued.
between TCP throughput, round-trip time (RTT),
Thus, these connections are also called barrier-
and receive window. Unlike previous approaches,
synchronized. The final performance is determined
which moderate the impact of TCP incast
by the slowest TCP connection, which may suffer
congestion by using a fine-grained timeout value,
from timeout due to packet loss. The performance
this plan is to design an Incast congestion Control
collapse of these many-to-one TCP connections is
for TCP (ICTCP) scheme on the receiver side. In
called
particular, this method changes the TCP receive
networks are well structured and layered to
window proactively before packet loss occurs.
achieve high bandwidth and low latency, and the
common.
Hence
TCP
TCP
incast
congestion.
Data-center
buffer size of top-of-rack (ToR) Ethernet switches Index terms – congestion
many-to-one communication,
control,
round
trip
time,
is usually small this is shown in the below figure.
TCP
throughput.
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
339
www.iaetsd.in
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ISBN: 378 - 26 - 138420 - 5
time for packet loss recovery with quicker retransmissions, or controlling switch buffer occupation to avoid overflow by using ECN and modified TCP on both the sender and receiver sides. The incast congestion in data-center networks is shown in the below figure. This paper focuses on avoiding packet loss before incast congestion, which is more appealing than Fig. 1 Data-center network of a ToR switch
recovery after loss and recovery schemes can be
connected to multiple rack-mounted servers.
complementary to congestion avoidance. Our idea is to perform incast congestion avoidance at the
A recent measurement study showed that a barriersynchronized
many-to-one
traffic
pattern
receiver side by preventing incast congestion. The
is
receiver side can adjust the receive window size of
common in data-center networks, mainly caused
each TCP connection, so the aggregate burstiness
by MapReduce and similar applications in data
of all the synchronized senders are kept under
centers this can be shown in below figure.
control. We call our design Incast congestion Control for TCP (ICTCP). We first perform congestion avoidance at the system level. We then use the per-flow state to finely tune the receive window of each connection on the receiver side. The technical novelties of this work are as follows: 1) To perform congestion control on the receiver side, we use the available bandwidth on the network interface as a quota to coordinate the receive
window
increase
of
all
incoming
Fig. 2 Incast Congestion in Data-center application
connections.
The root cause of TCP incast collapse is that the
2) Our per-flow congestion control is performed
highly bursty traffic of multiple TCP connections
independently of the slotted time of the round-trip
overflows the Ethernet switch buffer in a short
time (RTT) of each connection, which is also the
period of time, causing intense packet loss and
control latency in its feedback loop.
thus TCP retransmission and timeouts. Prior solutions focused on either reducing the response
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
340
www.iaetsd.in
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ISBN: 378 - 26 - 138420 - 5
3) Our receive window adjustment is based on the ratio of the difference between the measured and expected throughput over the expected. This allows us to estimate the throughput requirements from the sender side and adapt the receiver window accordingly. We also find that live RTT is necessary for throughput estimation as we have observed that TCP RTT in a high-bandwidth low-latency network increases with throughput, even if link
Fig. 4 Total goodput of multiple barrier-
capacity is not reached. We have developed and
synchronized TCP connections versus the number
implemented ICTCP as a Windows Network
of senders, where the data traffic volume per
Driver Interface Specification (NDIS) filter driver. i)
sender is a fixed amount.
TCP Incast Congestion
We first establish multiple TCP connections
Incast congestion happens when multiple sending
between all senders and the receiver, respectively.
servers under the same ToR switch send data to
Then, the receiver sends out a (very small) request
one receiver server simultaneously, as shown in
packet to ask each sender to transmit data,
Fig. 3.
respectively. The TCP connections are issued round by round, and one round ends when all connections on that round have finished their data transfer to the receiver. We observe similar goodput trends for three different traffic amounts per server, but with slightly different transition points. TCP throughput is severely degraded by incast congestion since one or more TCP
Fig. 3 Scenario of incast congestion in data-enter
connections can experience timeouts caused by
Networks.
packet drops. each
TCP variants sometimes improve performance, but
connection is relatively small. In Fig. 4, we show
cannot prevent incast congestion collapse since
the good input achieved on multiple connections
most of the timeouts are caused by full window
versus the number of sending servers.
losses due to Ethernet switch buffer overflow. The
The amount
of
data
transmitted by
TCP incast scenario is common for data-center applications.
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
341
www.iaetsd.in
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ii) Reasons for TCP Incast Congestion
buffer
overflow
significantly
ISBN: 378 - 26 - 138420 - 5
reduces
TCP
timeouts and saves unnecessary retransmissions.
Incast congestion happens when the switch buffer overflows because the network pipe is not large
We focus on the typical incast scenario where
enough to contain all TCP packets injected into the
dozens of servers are connected by a Gigabit
network.
The ToR switch is usually low-end
Ethernet switch. In this scenario, the congestion
(compared to high-layer ones), and thus the queue
point happens right before the receiver. A recent
size is not very large. To constrain the number of
measurement study showed that this scenario
packets on the flight, TCP has two windows: a
exists in data-center networks, and the traffic
congestion window on the sender side and a
between servers under the same ToR switch is
receive window on the receiver side. This paper
actually one of the most significant traffic patterns
chooses the TCP receive window adjustment as its
in data centers, as locality has been considered in
solution space. If the TCP receive window sizes
job distribution.
are properly controlled, the total receive window
We observe that the TCP receive window can be
size of all connections should be no greater than
used to throttle the TCP throughput, as it can be
the base BDP plus the queue size.
leveraged to handle incast congestion even though
II. SYSTEM DESIGN
they receive window was originally designed for flow control. The benefit of an incast congestion
Our goal is to improve TCP performance for incast
control scheme at the receiver side is that the
congestion without introducing a new transport-
receiver knows how much throughput it has
layer protocol. Our transport-layer solution keeps
achieved and how much available bandwidth
backward compatibility on the protocol and
remains. The difficulty at the receiver side is that
programming interface and makes our scheme
an overly throttled window may constrain TCP
general enough to handle the incast congestion in
performance, while an oversized window may not
future high-bandwidth and low-latency networks.
prevent incast congestion.
Therefore, the TCP connections could be incast or not, and the coexistence of incast and non-incast
As the base RTT is hundreds of microseconds in
connections is achieved.
data centers, our algorithm is restricted to adjust the receive window only for TCP flows with RTT
Previous work focused on how to mitigate the
less than 2 ms. this constraint is designed to focus
impact of timeouts, which are caused by a large
on low-latency flows. Based upon following
amount of packet loss on incast congestion. Given
observations, our receive-window based incast
such high bandwidth and low latency, we focus on
congestion control is intended to set a proper
how to perform congestion avoidance to prevent
receive window for all TCP connections sharing
switch buffer overflow. Avoiding unnecessary
the same last hop. Considering that there are many
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
342
www.iaetsd.in
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ISBN: 378 - 26 - 138420 - 5
TCP connections sharing the bottlenecked last hop
UDP or TCP, etc. Then, define the available
before incast congestion, we adjust the TCP
bandwidth to increase all of incoming connections
receive window to make those connections share
on that bandwidth BWA interface as
the bandwidth equally. This is because in a data
= max (0,
center, parallel TCP connections may belong to
∗ −
)
the same job, where the last one finished
Where β € [0, 1] is a parameter to absorb potential
determines the final performance.
oversubscribed
bandwidth
adjustment. A larger
III. ICTCP ALGORITHM
during
window
€ [0, 1] indicates the need
to more conservatively constrain the receive ICTCP
provides
a
receive-window-based
window and higher requirements for the switch
congestion control algorithm for TCP at the end-
buffer to avoid overflow; a lower β indicates the
system. The receive windows of all low-RTT TCP
need to more aggressively constrain the receive
connections are jointly adjusted to control
window, but throughput could be unnecessarily
throughput on incast congestion. ICTCP algorithm
throttled. A fixed setting of BWA in ICTCP, an
closely follows the design points made. It is
available bandwidth as the quota for all incoming
described how to set the receiver window of a
connections to increase the receive window for
TCP connection. i)
higher throughput. Each flow should estimate the
Bandwidth Estimation
potential throughput increase before its receiving
Using available bandwidth to increase all of
window is increased. Only when there is enough
incoming connections on the receiver server.
quota (BWA) can the receive window be increased,
Developing ICTCP as an NDIS driver on
and the corresponding quota is consumed to
Windows OS. Our NDIS driver intercepts TCP
prevent bandwidth oversubscription. To estimate
packets and modifies the receive window size if
the available bandwidth on the interface and
needed. It is assumed there is one network
provide a quota for a later receive window
interface on a receiver server, and define symbols
increase, we divide the time into slots. Each slot
corresponding to that interface. This algorithm can
consists of two sub slots of the same length. For
be applied to a scenario where the receiver has
each network interface, we measure all the traffic
multiple interfaces, and the connections on each
received in the first sub slot and use it to calculate
interface
algorithm
the available bandwidth as a quota for window
independently. Assume the link capacity of the
increase on the second sub slot. The receive
interface on the receiver server is L. Define the
window of any TCP connection is never increased
bandwidth of the total incoming traffic observed
at the first sub slot, but may be decreased when
on that interface as BWT, which includes all types
congestion is detected or the receive window is
of packets, i.e., broadcast, multicast, unicast of
identified as being over satisfied.
should
perform
this
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
343
www.iaetsd.in
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ii) Flow Stack Information
throughput and its expected throughput. The measured throughput represents the achieved
A flow table maintains the key data structure in
throughput on a TCP connection, also implies the
the Receiver server. A flow is identified by a 5tuple:
ISBN: 378 - 26 - 138420 - 5
source/destination
IP
current requirement of the application over that
address,
TCP connection.
source/destination port, and protocol. The flow table stores flow information for all the active
,
= max (
,
∗
,
+ (1 − )∗
)
flows. The packet header is parsed and the corresponding information is updated in the flow
am represents Incoming measured throughput
table. Network driver interface specification filter
aisrepresents Sample of current throughput (on
functionalities are performed like collecting
connection i) .The expected throughput represents
network statistics,
and
our expectation of the throughput on that TCP
filtering the unauthorized ones. In ICTCP, each
connection if the throughput is only constrained by
connection adjusts its receive window only when
receive window.
monitoring activities
an ACK is sending out on that connection. No
aie = max(ami, rwndi/RTTi)
additional pure TCP ACK packets are generated aie represents Expected throughput of i ,rwndi
solely for receive window adjustment, so that no
represents Receive window of i. The ratio of
traffic is wasted. For a TCP connection, after an
throughput
ACK is sent out, the data packet corresponding to
difference
represented as
that ACK arrives one RTT later. As a control
Tdib
=
(aie
-
of
connection
aim)/
aie.
i
is
system, the latency on the feedback loop is one
Our idea on receive window adjustment is to
RTT for each TCP connection, respectively.
increase window when the difference ratio of
Meanwhile, to estimate the throughput of a TCP
measured and expected throughput is small, while
connection for a receive window adjustment; the
decrease window when the difference ratio is
shortest timescale is an RTT for that connection.
large.
Therefore, the control interval for a TCP
iv) Choosing
connection is 2*RTT in ICTCP, and needed one
Fairness
among
Multiple
Connections
RTT latency for the adjusted window to take effect and one additional RTT to measure the achieved
When the receiver detects that the available
throughput with the newly adjusted receive
bandwidth has become smaller than the threshold,
window.
ICTCP starts to decrease the receiver window of the selected connections to prevent congestion.
iii) Receive Window Adjustment
Considering that multiple active TCP connections
For any ICTCP connection, the receive window is
typically work on the same job at the same time in
adjusted
a data center, there is a method that can achieve
based
on
its
incoming
measured
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
344
www.iaetsd.in
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ISBN: 378 - 26 - 138420 - 5
fair sharing for all connections without sacrificing
communication,‖ in Proc. ACMSIGCOMM,
throughput.
2009, pp. 303–314. IV. CONCLUSION
[3] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken, ―The nature of data
This paper presented an effective, practical and
center traffic: Measurements & analysis,‖ in
safe solution to eliminate TCP incast collapse in
Proc. IMC, 2009, pp. 202–208.
datacenter environments. Our approach utilizes the link bandwidth as fully as possible but without
[4] J. Dean and S. Ghemawat, ―MapReduce:
packet losses by limiting the round trip time value.
Simplified data processing on large clusters,‖
Based on the concept of bandwidth delay product,
in Proc. OSDI, 2004, p. 10.
our
technique
conservatively
estimates
the [5] M. Alizadeh, A. Greenberg, D.Maltz, J.
reasonable number of concurrent senders. In this work
we
used
Network
driver
Padhye, P. Patel, B.Prabhakar, S. Sengupta,
interface
and M. Sridharan, ―Data center TCP
specification filter functionalities like collecting network
statistics;
monitoring,
(DCTCP),‖ in Proc. SIGCOMM, 2010, pp. 63–
filtering
74.
unauthorized ones are done. We can avoid retransmission, safely send the data to the receiver
[6] D. Nagle, D. Serenyi, and A. Matthews, ―The
and also avoid congestion and traffic. Our system
PanasasActiveScale
is capable to match user preferences while
Delivering scalable high bandwidth storage,‖
achieving full utilization of the receiver’s access in
in Proc.SC, 2004, p. 53.
storage
cluster:
many different scenarios. AUTHORS REFERENCES
S Anil Kumar has received his B.Tech
[1] A. Phanishayee, E. Krevat, V. Vasudevan, D.
degree
in
Computer
Andersen, G. Ganger, G.Gibson, and S.
Science & Engineering from
Seshan, ―Measurement and analysis of TCP
Chadalavada
throughput collapse in cluster-based storage
Engineering College, Tirupathi
systems,‖ in Proc. USENIX FAST, 2008,
affiliated to JNTU, Anantapur in 2009 and
Article no. 12.
pursuing M.Tech degree in Computer Science & Engineering
[2] V. Vasudevan, A. Phanishayee, H. Shah, E.
at
Audisankara
Ramanamma
College
of
Engineering & Technology, Gudur affiliated to
Krevat, D. Andersen, G. Ganger, G. Gibson,
JNTU, Anantapur in (2012-2014).
and B.Mueller, ―Safe and effective finegrained TCP retransmissions for datacenter
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
345
www.iaetsd.in
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ISBN: 378 - 26 - 138420 - 5
G Rajesh M.Tech, (Ph.D) currently he is working as Assistant
Professor
Audisankara
College
in of
Engineering And Technology, Gudur (M), Nellore, Andhra Pradesh, India. He has seven years of experience in teaching and two years of experience in Software Industry. Previously he trained and worked with
DSRC
(Data
Software
Research
Company) Chennai on Oracle Application Functional Consultant. And he has worked with Capgemini India Ltd Mumbai as a Software Engineer (Oracle Apps Technical Consultant) as a contract employee through the Datamatics Pvt Ltd. He was doing his Ph.D on “A Cross Layer Framework for Bandwidth Management of Wireless Mesh Networks”.
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
346
www.iaetsd.in