Iaetsd an effective approach to eliminate tcp incast by Iaetsd Iaetsd

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

AN EFFECTIVE APPROACH TO ELIMINATE TCP INCAST COLLAPSE IN DATACENTER ENVIRONMENTS S Anil Kumar1, G Rajesh2 1

M.Tech 2nd year, Dept. of CSE, Audisankara College of Engineering & Technology, Gudur, A.P, India.

Asst Professor, Dept. of CSE, Audisankara College of Engineering & Technology, Gudur, A.P, India.

Abstract â&#x20AC;&#x201C; Transport

Control Protocol (TCP)

many to one communication congestion happens

Transport Control Protocol (TCP) is widely used

in high-bandwidth and low-latency networks when

on the Internet and normally works fine.

two or more synchronized servers send data to the same receiver in parallel.

However, recent works have shown that TCP does

For many key data-

not work well for many-to-one traffic patterns on

center applications such as MapReduce and Search,

this

many-to-one traffic pattern

INTRODUCTION

high-bandwidth, low-latency networks.

many-to-one

Congestion occurs when many synchronized

communication congestion may severely degrade

servers under the same Gigabit Ethernet switch

their performances, e.g., by enhancing response

simultaneously send data to one receiver in

time. In this paper, we explore the many-to-one

parallel. Only after all connections have finished

communication by focusing on the relationships

the data transmission can the next round be issued.

between TCP throughput, round-trip time (RTT),

Thus, these connections are also called barrier-

and receive window. Unlike previous approaches,

synchronized. The final performance is determined

which moderate the impact of TCP incast

by the slowest TCP connection, which may suffer

congestion by using a fine-grained timeout value,

from timeout due to packet loss. The performance

this plan is to design an Incast congestion Control

collapse of these many-to-one TCP connections is

for TCP (ICTCP) scheme on the receiver side. In

called

particular, this method changes the TCP receive

networks are well structured and layered to

window proactively before packet loss occurs.

achieve high bandwidth and low latency, and the

common.

Hence

TCP

incast

congestion.

Data-center

buffer size of top-of-rack (ToR) Ethernet switches Index terms â&#x20AC;&#x201C; congestion

many-to-one communication,

control,

round

trip

time,

is usually small this is shown in the below figure.

TCP

throughput.

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

339

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

time for packet loss recovery with quicker retransmissions, or controlling switch buffer occupation to avoid overflow by using ECN and modified TCP on both the sender and receiver sides. The incast congestion in data-center networks is shown in the below figure. This paper focuses on avoiding packet loss before incast congestion, which is more appealing than Fig. 1 Data-center network of a ToR switch

recovery after loss and recovery schemes can be

connected to multiple rack-mounted servers.

complementary to congestion avoidance. Our idea is to perform incast congestion avoidance at the

A recent measurement study showed that a barriersynchronized

many-to-one

traffic

pattern

receiver side by preventing incast congestion. The

receiver side can adjust the receive window size of

common in data-center networks, mainly caused

each TCP connection, so the aggregate burstiness

by MapReduce and similar applications in data

of all the synchronized senders are kept under

centers this can be shown in below figure.

control. We call our design Incast congestion Control for TCP (ICTCP). We first perform congestion avoidance at the system level. We then use the per-flow state to finely tune the receive window of each connection on the receiver side. The technical novelties of this work are as follows: 1) To perform congestion control on the receiver side, we use the available bandwidth on the network interface as a quota to coordinate the receive

window

increase

all

incoming

Fig. 2 Incast Congestion in Data-center application

connections.

The root cause of TCP incast collapse is that the

2) Our per-flow congestion control is performed

highly bursty traffic of multiple TCP connections

independently of the slotted time of the round-trip

overflows the Ethernet switch buffer in a short

time (RTT) of each connection, which is also the

period of time, causing intense packet loss and

control latency in its feedback loop.

thus TCP retransmission and timeouts. Prior solutions focused on either reducing the response

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

340

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

3) Our receive window adjustment is based on the ratio of the difference between the measured and expected throughput over the expected. This allows us to estimate the throughput requirements from the sender side and adapt the receiver window accordingly. We also find that live RTT is necessary for throughput estimation as we have observed that TCP RTT in a high-bandwidth low-latency network increases with throughput, even if link

Fig. 4 Total goodput of multiple barrier-

capacity is not reached. We have developed and

synchronized TCP connections versus the number

implemented ICTCP as a Windows Network

of senders, where the data traffic volume per

Driver Interface Specification (NDIS) filter driver. i)

sender is a fixed amount.

TCP Incast Congestion

We first establish multiple TCP connections

Incast congestion happens when multiple sending

between all senders and the receiver, respectively.

servers under the same ToR switch send data to

Then, the receiver sends out a (very small) request

one receiver server simultaneously, as shown in

packet to ask each sender to transmit data,

Fig. 3.

respectively. The TCP connections are issued round by round, and one round ends when all connections on that round have finished their data transfer to the receiver. We observe similar goodput trends for three different traffic amounts per server, but with slightly different transition points. TCP throughput is severely degraded by incast congestion since one or more TCP

Fig. 3 Scenario of incast congestion in data-enter

connections can experience timeouts caused by

Networks.

packet drops. each

TCP variants sometimes improve performance, but

connection is relatively small. In Fig. 4, we show

cannot prevent incast congestion collapse since

the good input achieved on multiple connections

most of the timeouts are caused by full window

versus the number of sending servers.

losses due to Ethernet switch buffer overflow. The

The amount

data

transmitted by

TCP incast scenario is common for data-center applications.

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

341

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ii) Reasons for TCP Incast Congestion

buffer

overflow

significantly

ISBN: 378 - 26 - 138420 - 5

reduces

TCP

timeouts and saves unnecessary retransmissions.

Incast congestion happens when the switch buffer overflows because the network pipe is not large

We focus on the typical incast scenario where

enough to contain all TCP packets injected into the

dozens of servers are connected by a Gigabit

network.

The ToR switch is usually low-end

Ethernet switch. In this scenario, the congestion

(compared to high-layer ones), and thus the queue

point happens right before the receiver. A recent

size is not very large. To constrain the number of

measurement study showed that this scenario

packets on the flight, TCP has two windows: a

exists in data-center networks, and the traffic

congestion window on the sender side and a

between servers under the same ToR switch is

receive window on the receiver side. This paper

actually one of the most significant traffic patterns

chooses the TCP receive window adjustment as its

in data centers, as locality has been considered in

solution space. If the TCP receive window sizes

job distribution.

are properly controlled, the total receive window

We observe that the TCP receive window can be

size of all connections should be no greater than

used to throttle the TCP throughput, as it can be

the base BDP plus the queue size.

leveraged to handle incast congestion even though

II. SYSTEM DESIGN

they receive window was originally designed for flow control. The benefit of an incast congestion

Our goal is to improve TCP performance for incast

control scheme at the receiver side is that the

congestion without introducing a new transport-

receiver knows how much throughput it has

layer protocol. Our transport-layer solution keeps

achieved and how much available bandwidth

backward compatibility on the protocol and

remains. The difficulty at the receiver side is that

programming interface and makes our scheme

an overly throttled window may constrain TCP

general enough to handle the incast congestion in

performance, while an oversized window may not

future high-bandwidth and low-latency networks.

prevent incast congestion.

Therefore, the TCP connections could be incast or not, and the coexistence of incast and non-incast

As the base RTT is hundreds of microseconds in

connections is achieved.

data centers, our algorithm is restricted to adjust the receive window only for TCP flows with RTT

Previous work focused on how to mitigate the

less than 2 ms. this constraint is designed to focus

impact of timeouts, which are caused by a large

on low-latency flows. Based upon following

amount of packet loss on incast congestion. Given

observations, our receive-window based incast

such high bandwidth and low latency, we focus on

congestion control is intended to set a proper

how to perform congestion avoidance to prevent

receive window for all TCP connections sharing

switch buffer overflow. Avoiding unnecessary

the same last hop. Considering that there are many

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

342

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

TCP connections sharing the bottlenecked last hop

UDP or TCP, etc. Then, define the available

before incast congestion, we adjust the TCP

bandwidth to increase all of incoming connections

receive window to make those connections share

on that bandwidth BWA interface as

the bandwidth equally. This is because in a data

= max (0,

center, parallel TCP connections may belong to

∗ −

)

the same job, where the last one finished

Where β € [0, 1] is a parameter to absorb potential

determines the final performance.

oversubscribed

bandwidth

adjustment. A larger

III. ICTCP ALGORITHM

during

window

€ [0, 1] indicates the need

to more conservatively constrain the receive ICTCP

provides

receive-window-based

window and higher requirements for the switch

congestion control algorithm for TCP at the end-

buffer to avoid overflow; a lower β indicates the

system. The receive windows of all low-RTT TCP

need to more aggressively constrain the receive

connections are jointly adjusted to control

window, but throughput could be unnecessarily

throughput on incast congestion. ICTCP algorithm

throttled. A fixed setting of BWA in ICTCP, an

closely follows the design points made. It is

available bandwidth as the quota for all incoming

described how to set the receiver window of a

connections to increase the receive window for

TCP connection. i)

higher throughput. Each flow should estimate the

Bandwidth Estimation

potential throughput increase before its receiving

Using available bandwidth to increase all of

window is increased. Only when there is enough

incoming connections on the receiver server.

quota (BWA) can the receive window be increased,

Developing ICTCP as an NDIS driver on

and the corresponding quota is consumed to

Windows OS. Our NDIS driver intercepts TCP

prevent bandwidth oversubscription. To estimate

packets and modifies the receive window size if

the available bandwidth on the interface and

needed. It is assumed there is one network

provide a quota for a later receive window

interface on a receiver server, and define symbols

increase, we divide the time into slots. Each slot

corresponding to that interface. This algorithm can

consists of two sub slots of the same length. For

be applied to a scenario where the receiver has

each network interface, we measure all the traffic

multiple interfaces, and the connections on each

received in the first sub slot and use it to calculate

interface

algorithm

the available bandwidth as a quota for window

independently. Assume the link capacity of the

increase on the second sub slot. The receive

interface on the receiver server is L. Define the

window of any TCP connection is never increased

bandwidth of the total incoming traffic observed

at the first sub slot, but may be decreased when

on that interface as BWT, which includes all types

congestion is detected or the receive window is

of packets, i.e., broadcast, multicast, unicast of

identified as being over satisfied.

should

perform

this

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

343

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ii) Flow Stack Information

throughput and its expected throughput. The measured throughput represents the achieved

A flow table maintains the key data structure in

throughput on a TCP connection, also implies the

the Receiver server. A flow is identified by a 5tuple:

ISBN: 378 - 26 - 138420 - 5

source/destination

current requirement of the application over that

address,

TCP connection.

source/destination port, and protocol. The flow table stores flow information for all the active

= max (

∗

+ (1 − )∗

)

flows. The packet header is parsed and the corresponding information is updated in the flow

am represents Incoming measured throughput

table. Network driver interface specification filter

aisrepresents Sample of current throughput (on

functionalities are performed like collecting

connection i) .The expected throughput represents

network statistics,

and

our expectation of the throughput on that TCP

filtering the unauthorized ones. In ICTCP, each

connection if the throughput is only constrained by

connection adjusts its receive window only when

receive window.

monitoring activities

an ACK is sending out on that connection. No

aie = max(ami, rwndi/RTTi)

additional pure TCP ACK packets are generated aie represents Expected throughput of i ,rwndi

solely for receive window adjustment, so that no

represents Receive window of i. The ratio of

traffic is wasted. For a TCP connection, after an

throughput

ACK is sent out, the data packet corresponding to

difference

represented as

that ACK arrives one RTT later. As a control

Tdib

(aie

connection

aim)/

aie.

system, the latency on the feedback loop is one

Our idea on receive window adjustment is to

RTT for each TCP connection, respectively.

increase window when the difference ratio of

Meanwhile, to estimate the throughput of a TCP

measured and expected throughput is small, while

connection for a receive window adjustment; the

decrease window when the difference ratio is

shortest timescale is an RTT for that connection.

large.

Therefore, the control interval for a TCP

iv) Choosing

connection is 2*RTT in ICTCP, and needed one

Fairness

among

Multiple

Connections

RTT latency for the adjusted window to take effect and one additional RTT to measure the achieved

When the receiver detects that the available

throughput with the newly adjusted receive

bandwidth has become smaller than the threshold,

window.

ICTCP starts to decrease the receiver window of the selected connections to prevent congestion.

iii) Receive Window Adjustment

Considering that multiple active TCP connections

For any ICTCP connection, the receive window is

typically work on the same job at the same time in

adjusted

a data center, there is a method that can achieve

based

its

incoming

measured

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

344

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

fair sharing for all connections without sacrificing

communication,‖ in Proc. ACMSIGCOMM,

throughput.

2009, pp. 303–314. IV. CONCLUSION

[3] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken, ―The nature of data

This paper presented an effective, practical and

center traffic: Measurements & analysis,‖ in

safe solution to eliminate TCP incast collapse in

Proc. IMC, 2009, pp. 202–208.

datacenter environments. Our approach utilizes the link bandwidth as fully as possible but without

[4] J. Dean and S. Ghemawat, ―MapReduce:

packet losses by limiting the round trip time value.

Simplified data processing on large clusters,‖

Based on the concept of bandwidth delay product,

in Proc. OSDI, 2004, p. 10.

our

technique

conservatively

estimates

the [5] M. Alizadeh, A. Greenberg, D.Maltz, J.

reasonable number of concurrent senders. In this work

used

Network

driver

Padhye, P. Patel, B.Prabhakar, S. Sengupta,

interface

and M. Sridharan, ―Data center TCP

specification filter functionalities like collecting network

statistics;

monitoring,

(DCTCP),‖ in Proc. SIGCOMM, 2010, pp. 63–

filtering

74.

unauthorized ones are done. We can avoid retransmission, safely send the data to the receiver

[6] D. Nagle, D. Serenyi, and A. Matthews, ―The

and also avoid congestion and traffic. Our system

PanasasActiveScale

is capable to match user preferences while

Delivering scalable high bandwidth storage,‖

achieving full utilization of the receiver’s access in

in Proc.SC, 2004, p. 53.

storage

cluster:

many different scenarios. AUTHORS REFERENCES

S Anil Kumar has received his B.Tech

[1] A. Phanishayee, E. Krevat, V. Vasudevan, D.

degree

Computer

Andersen, G. Ganger, G.Gibson, and S.

Science & Engineering from

Seshan, ―Measurement and analysis of TCP

Chadalavada

throughput collapse in cluster-based storage

Engineering College, Tirupathi

systems,‖ in Proc. USENIX FAST, 2008,

affiliated to JNTU, Anantapur in 2009 and

Article no. 12.

pursuing M.Tech degree in Computer Science & Engineering

[2] V. Vasudevan, A. Phanishayee, H. Shah, E.

Audisankara

Ramanamma

College

Engineering & Technology, Gudur affiliated to

Krevat, D. Andersen, G. Ganger, G. Gibson,

JNTU, Anantapur in (2012-2014).

and B.Mueller, ―Safe and effective finegrained TCP retransmissions for datacenter

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

345

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

G Rajesh M.Tech, (Ph.D) currently he is working as Assistant

Professor

Audisankara

College

in of

Engineering And Technology, Gudur (M), Nellore, Andhra Pradesh, India. He has seven years of experience in teaching and two years of experience in Software Industry. Previously he trained and worked with

DSRC

(Data

Software

Research

Company) Chennai on Oracle Application Functional Consultant. And he has worked with Capgemini India Ltd Mumbai as a Software Engineer (Oracle Apps Technical Consultant) as a contract employee through the Datamatics Pvt Ltd. He was doing his Ph.D on “A Cross Layer Framework for Bandwidth Management of Wireless Mesh Networks”.

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

346