01.11.2014 Views

Internet Transport Protocols UDP and TCP

Internet Transport Protocols UDP and TCP

Internet Transport Protocols UDP and TCP

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Outline<br />

<strong>Internet</strong> <strong>Transport</strong> <strong>Protocols</strong><br />

<strong>UDP</strong> <strong>and</strong> <strong>TCP</strong><br />

<strong>Transport</strong> Layer Review<br />

<strong>UDP</strong> Protocol<br />

<strong>UDP</strong> Characteristics<br />

<strong>UDP</strong> Functionalities<br />

<strong>TCP</strong> Protocol<br />

<strong>TCP</strong> Characteristics<br />

Connection Management<br />

<strong>TCP</strong> Flow <strong>and</strong> Congestion Control<br />

Design Issues<br />

TRANSPORT LAYER<br />

<strong>Transport</strong> Layer Services <strong>and</strong><br />

<strong>Protocols</strong><br />

<strong>Transport</strong> layer provides a logical connection<br />

between application processes running on<br />

different hosts<br />

<strong>Transport</strong> Layer Services<br />

<br />

<br />

<br />

<br />

Connection Management<br />

If connection-oriented<br />

Multiplexing <strong>and</strong> De-Multiplexing<br />

Data Segmentation <strong>and</strong> Reassembly<br />

Error, Flow <strong>and</strong> Congestion Control<br />

1


<strong>Transport</strong> Layer Concepts<br />

Host<br />

Application<br />

<strong>Transport</strong><br />

IP<br />

Network<br />

Access<br />

Physical<br />

<strong>Internet</strong><br />

Logical Connection<br />

Router<br />

NA<br />

PHY<br />

IP<br />

NA<br />

PHY<br />

Host<br />

Application<br />

<strong>Transport</strong><br />

IP<br />

Network<br />

Access<br />

Physical<br />

<strong>Internet</strong><br />

<strong>Internet</strong> <strong>Transport</strong> Layer <strong>Protocols</strong><br />

The IP suite offers two transport protocols<br />

User Datagram Protocol (<strong>UDP</strong><br />

Connectionless protocol<br />

“Best Effort” Service<br />

Unreliable<br />

Unordered datagram delivery<br />

No error or flow control<br />

Transmission Control Protocol<br />

Connection-oriented protocol<br />

Reliable, ordered delivery of byte stream<br />

Error, flow <strong>and</strong> congestion control<br />

No delay guarantees <strong>and</strong> no b<strong>and</strong>width guarantees<br />

User Datagram Protocol<br />

<strong>UDP</strong><br />

<strong>UDP</strong> Characteristics<br />

<strong>UDP</strong> is a connectionless datagram service.<br />

No need to establish a connection prior to data transfer<br />

Datagrams may be generated <strong>and</strong> transmitted at any time.<br />

<strong>UDP</strong> datagrams are self-contained.<br />

<strong>UDP</strong> is unreliable:<br />

No acknowledgements for reliable delivery of data.<br />

Checksums cover the header, <strong>and</strong> only optionally cover<br />

the data.<br />

Contains no mechanism to detect missing or out of<br />

sequence datagrams.<br />

No mechanism for automatic retransmission.<br />

No mechanism for flow control<br />

<br />

Sender can over-run the receiver.<br />

2


<strong>UDP</strong> Service<br />

<strong>UDP</strong> provides unreliable connectionless<br />

delivery service using IP to transport<br />

datagrams<br />

<strong>UDP</strong> does not enhance the “best effort” service<br />

provided by IP<br />

<strong>UDP</strong> provides a mechanism to distinguish<br />

among multiple destinations within a host<br />

The mechanism, referred to as multiplexing, is<br />

based on the notion of “ports”<br />

IP Communication Ports<br />

To communicate, a sender needs to know<br />

<br />

<br />

IP address of the remote host <strong>and</strong><br />

Port number within that host<br />

Multiplexing in <strong>UDP</strong><br />

<strong>UDP</strong> Operation<br />

Port 1 Port 2 Port 3 Port 1 Port 2 Port 3 Port 4<br />

A1<br />

A2<br />

B1<br />

B2<br />

<strong>UDP</strong> Module<br />

<strong>TCP</strong> Module<br />

App<br />

App<br />

App<br />

App<br />

RARP Module<br />

IP Module<br />

ARP Module<br />

OS<br />

<strong>UDP</strong><br />

Network Interface<br />

Frame<br />

IP address, Protocol,<br />

Port number<br />

IP<br />

<strong>UDP</strong> uses port number<br />

to de-multiplex packets<br />

3


User Datagram Protocol<br />

<strong>UDP</strong> Checksum<br />

Physical<br />

Header<br />

<strong>UDP</strong> Source Port<br />

<strong>UDP</strong> Message Length<br />

Data<br />

IP Datagram<br />

Physical Data Frame<br />

<strong>UDP</strong> Destination Port<br />

<strong>UDP</strong> Checksum<br />

IP Header <strong>UDP</strong> Header Data<br />

Physical<br />

Trailer<br />

<strong>UDP</strong> source port is optional (set to 0 if not<br />

used)<br />

<strong>UDP</strong> checksum is optional (set to 0 if not used)<br />

Using <strong>UDP</strong> checksum option is useful, however,<br />

since IP checksum does not cover the data portion<br />

of the datagram<br />

It provides the only way to ensure that data has<br />

arrived intact <strong>and</strong> should be used<br />

<br />

Also, the only way to verify the <strong>UDP</strong> header<br />

<strong>UDP</strong> Checksum<br />

Appropriate Uses of <strong>UDP</strong><br />

<br />

The <strong>UDP</strong> checksum covers more information than is<br />

present in the <strong>UDP</strong> datagram<br />

A pseudo-header is prepended to the <strong>UDP</strong> datagram, <strong>and</strong> a<br />

checksum over the entire object is computed<br />

The pseudo-header contains source <strong>and</strong> destination IP<br />

addresses, IP protocol type (code 17 for <strong>UDP</strong>), <strong>and</strong> <strong>UDP</strong><br />

datagram length (the pseudo-header not included)<br />

It guarantees that the datagram has reached the proper<br />

destination<br />

The checksum is computed as a one’s complement sum (sum<br />

modulo 2 16 -1) of all 16-bit words of the header <strong>and</strong> the pseudo-header<br />

(checksum field is set to 0) <strong>and</strong> taking one’s complement of the<br />

result<br />

Inward data collection<br />

Outward data dissemination<br />

Request-response<br />

Real-time applications<br />

Streaming of real-time audio <strong>and</strong> video.<br />

On-time delivery is important<br />

<br />

Minimum overhead<br />

4


Transmission Control Protocol<br />

<strong>TCP</strong><br />

Transmission Control Protocol<br />

<strong>TCP</strong> provides a reliable virtual circuit service<br />

<strong>TCP</strong> guarantees ordered delivery of a stream of data<br />

without loss or duplicatio, despite unreliable network<br />

packet delivery service<br />

<strong>TCP</strong> assumes little about the underlying<br />

communication system<br />

<strong>TCP</strong> can be used with a variety of packet delivery<br />

services<br />

Dial-up telephone lines<br />

Local <strong>and</strong> wide area networks<br />

Low <strong>and</strong> high-speed long haul networks<br />

<strong>TCP</strong> Interface Characteristics<br />

Virtual Circuit connection<br />

<strong>TCP</strong> is full-duplex<br />

<strong>TCP</strong> is stream-oriented protocol<br />

Data is viewed as a stream of bytes<br />

<strong>TCP</strong> provides an unstructured stream<br />

<strong>TCP</strong> does not mark boundaries between streams<br />

Application’s responsibility to underst<strong>and</strong> stream contents<br />

Buffered transfer<br />

<strong>TCP</strong> collects enough data to fill a reasonably large datagram<br />

before transmission<br />

<strong>TCP</strong> provides a push mechanism to force transfer, if needed<br />

<strong>TCP</strong> supports a “stream of<br />

bytes” service<br />

Sender<br />

Receiver<br />

5


Stream Service is emulated using<br />

<strong>TCP</strong> “Segments”<br />

Sender<br />

Receiver<br />

<strong>TCP</strong> Data<br />

<strong>TCP</strong> Data<br />

Segment sent when:<br />

Segment is full<br />

MSS bytes,<br />

Segment not full, but times out, or<br />

“Pushed” by application.<br />

MSS = Maximum Segment Size<br />

<strong>TCP</strong> MSS<br />

<strong>TCP</strong> segments are the messages that carry data<br />

between <strong>TCP</strong> sender <strong>and</strong> receiver<br />

<strong>TCP</strong> must decide how many bytes to put into each<br />

message that it sends.<br />

The current size of a <strong>TCP</strong> segment, CSS, is<br />

determined by two factors<br />

The size of the receiver’s window (bytes) – W<br />

A ceiling on <strong>TCP</strong> size segment size, never to be<br />

exceeded – Maximum Segment Size (MSS)<br />

CSS = minimum(MSS, W)<br />

<strong>TCP</strong> Interface Characteristics<br />

<strong>TCP</strong> specifications describe generally how<br />

applications use <strong>TCP</strong>, but do not dictate details<br />

of an interface<br />

There have been numerous implementations of <strong>TCP</strong><br />

<br />

<strong>TCP</strong> Reno, <strong>TCP</strong> Tahoe, <strong>TCP</strong> Las Vegas, SACK <strong>TCP</strong>, ….<br />

<strong>TCP</strong> Connection Establishment<br />

Initially, a service user at site 1 issues a passive<br />

open function<br />

Willingness to accept connections from other users<br />

<strong>TCP</strong> returns a passive open confirm with local<br />

connection name to use when a connection is set up<br />

Service user at site 1 can start “listening” for<br />

requests<br />

6


<strong>TCP</strong> Connection Establishment<br />

Three-way H<strong>and</strong>shake<br />

To establish a connection with service at site 1,<br />

a user at side 2 issues an active open<br />

Active open contains data needed to set up the<br />

connection<br />

This begins a three-way h<strong>and</strong>shake<br />

Send data unit<br />

with SYN bit set<br />

<strong>and</strong> seq# = x<br />

Receive data unit<br />

SYN x<br />

ACK x+1/SYN y<br />

Receive data unit<br />

Send data unit<br />

with SYN bit set<br />

<strong>and</strong> seq# = y,<br />

acknowledge x+1<br />

Send data unit<br />

acknowledge y+1<br />

ACK y+1<br />

Receive data unit<br />

Connection Termination<br />

Connection Termination<br />

<strong>TCP</strong> connections are terminated with “graceful<br />

close”<br />

Graceful close ensures that the connection is<br />

terminated after all data had been received<br />

One side issues Close request, <strong>and</strong> is not permitted<br />

to transmit data after sending it<br />

The other side acknowledges it, but can send data<br />

until it sends Close request too<br />

After the second Close request is acknowledged, the<br />

connection is closed<br />

No data transmission<br />

from this side<br />

FIN x<br />

ACK x+1/ Data y<br />

ACK y+k+1<br />

FIN y+k<br />

ACK y+k+1<br />

Last segment sent by receiver<br />

• Sequence number : y<br />

• Length: k<br />

7


<strong>TCP</strong> Segment Format<br />

<strong>TCP</strong> Segment Fields<br />

0 15 31<br />

Data<br />

Offset<br />

Source Port<br />

Reserved<br />

Sequence Number<br />

Acknowledgement Number<br />

U<br />

R<br />

G<br />

Checksum<br />

A<br />

C<br />

K<br />

P<br />

S<br />

H<br />

R<br />

S<br />

T<br />

Options<br />

S<br />

Y<br />

N<br />

F<br />

I<br />

N<br />

Data<br />

Destination Port<br />

Window<br />

Urgent pointer<br />

Padding<br />

<br />

<br />

<br />

<br />

Source <strong>and</strong> Destination Port Number<br />

16-bit – end-point identifiers<br />

Sequence Number<br />

<br />

<br />

32-bit – number of the first data octet in this segment except for<br />

when the SYN flag is set<br />

When the SYN flag is set, sequence number identifies “the Initial<br />

Sequence Number” <strong>and</strong> the first data octet is ISN+1<br />

Acknowledgement Number<br />

32-bit – number of the next octet expected to receive<br />

Data Offset<br />

4-bit – number of 32-bit words in the header (including options)<br />

<br />

Header Length<br />

<strong>TCP</strong> Flags<br />

URG: segment contains urgent data:<br />

urgent pointer points to the last octet of urgent data<br />

ACK: acknowledgement field is significant<br />

PSH: segment is sent with the push function<br />

RST: reset the connection<br />

<br />

Closes a half-open connection<br />

SYN: synchronize sequence numbers<br />

<br />

Used during open request<br />

FIN: no more data from sender<br />

<br />

Used during close request<br />

<strong>TCP</strong> Segment Fields<br />

<br />

<br />

Window – 16-bit number of unacknowledged octets that<br />

the sender is allowed to transmit<br />

Maximum value limits the value of the window size to 216 , unless<br />

“window scale” option is used<br />

Checksum – 16-bit one’s complement of a one’s<br />

complement sum of all 16-bit words of the segment <strong>and</strong> a<br />

pseudo-header (the checksum field is set to zero)<br />

The pseudo-header contains the sender’s <strong>and</strong> receiver’s IP<br />

addresses, the protocol type (code 6 for <strong>TCP</strong>), <strong>and</strong> the segment<br />

length field)<br />

A zero padding is added to the segment to make the segment<br />

multiple of 16 bits<br />

The pseudo-header <strong>and</strong> the padding are not transmitted<br />

8


Pseudo-header<br />

Out-of-B<strong>and</strong> Data<br />

Source address (32-bits)<br />

Destination address (32-bits)<br />

00000000 Protocol<br />

00000110<br />

<strong>TCP</strong> segment length<br />

Out-of-b<strong>and</strong> data is needed to h<strong>and</strong>le abort or<br />

program interrupt signals<br />

These “signals” should not wait for the octets<br />

already in the <strong>TCP</strong> stream to be transmitted<br />

<br />

Telnet uses this feature to send “interrupt” comm<strong>and</strong>s<br />

<strong>TCP</strong> uses URGENT feature to accomodate<br />

out-of-b<strong>and</strong> data<br />

Out-of-B<strong>and</strong> Data<br />

Out-of-B<strong>and</strong> Data<br />

User<br />

Process<br />

Out-of-B<strong>and</strong><br />

Data<br />

Send<br />

Buffer<br />

Network<br />

Out-of-B<strong>and</strong><br />

Data<br />

Receive<br />

Buffer<br />

Application is required to check on the “Out-of-B<strong>and</strong>”<br />

data stream before processing regular data stream<br />

User<br />

Process<br />

<strong>TCP</strong> notifies associated application of the<br />

beginning <strong>and</strong> end of urgent mode<br />

How <strong>TCP</strong> informs the application depends on the<br />

operation system<br />

Urgent data is detected by the URG flag set<br />

<strong>and</strong> retrieved by the urgent pointer<br />

Unfortunately, <strong>TCP</strong> does not denote the beginning<br />

of the urgent data in the segment<br />

<br />

It’s up to the application to decide, where the urgent data<br />

starts<br />

9


<strong>TCP</strong> Options<br />

Originally, one option, maximum segment size<br />

was defined<br />

The 16-bit option may be used only in the initial<br />

connection request segment<br />

<br />

If this option is not used, any segment size is allowed<br />

Later, two other options have gained<br />

widespread acceptance<br />

Window scale factor<br />

Timestamps<br />

<strong>TCP</strong> Options<br />

End of options<br />

Kind = 0<br />

(8 bits)<br />

No operation : padding<br />

Kind = 1<br />

(8 bits)<br />

<strong>TCP</strong> Options<br />

Maximum Segment Size<br />

Window Scale Option – Purpose<br />

<br />

Kind = 2<br />

(8 bits)<br />

Length = 4<br />

(8 bits)<br />

Maximum Segment Size<br />

(16 bits)<br />

Window Scale Factor<br />

Kind = 3<br />

(8 bits)<br />

Length = 3<br />

(8 bits)<br />

Shift Count<br />

(8 bits)<br />

10


Window Scale Option<br />

<br />

<br />

<br />

The option increases the <strong>TCP</strong> receive window above its<br />

maximum value of 65,535 bytes<br />

A much better alternative than changing the <strong>TCP</strong> header to<br />

accomodate the need for larger windows<br />

When window scale option is used, the value of the field is<br />

multiplied by 2 F , where F is between 1 <strong>and</strong> 14, is a scale<br />

option specified in the initial connection request segment<br />

A value F = 14 results in a maximum window size of over 10 9 bytes<br />

(65535 2 14 )<br />

Internally, <strong>TCP</strong> maintains a “real” window size of 32 bits<br />

The option can only appear in a SYN segment<br />

Thus, scale factor is agreed upon <strong>and</strong> fixed in each direction at the<br />

connection setup<br />

<strong>TCP</strong> Options<br />

Timestamp – Used for 2 distinct mechanisms<br />

RTT Measurement (RTTM) – To track the RTT for data<br />

in order to identify changes in latency that may require ack<br />

timer adjustment<br />

Protect Against Wrapped Sequences (PWAS)<br />

Kind = 8<br />

(8 bits)<br />

Length = 10<br />

(8 bits)<br />

Timestamp Value (LSB)<br />

(16 bits)<br />

Timestamp Echo Reply<br />

(LSB) (16 bits)<br />

Timestamp Value (MSB)<br />

(16 bits)<br />

Timestamp Echo Reply<br />

(MSB) (16 bits)<br />

This option can be used throughout the <strong>TCP</strong> connection<br />

<strong>TCP</strong> FLOW CONTROL<br />

<strong>TCP</strong> Data Flow Control<br />

<strong>TCP</strong> uses a variable size window protocol to<br />

regulate the flow control<br />

Its unique feature is that it decouples<br />

acknowledgements of received data units from<br />

the granting of a permission to send additional<br />

data<br />

<br />

Each acknowledgement contains a window<br />

advertisement<br />

Increased window advertisement allows sender to proceed with<br />

sending octets not acknowledged yet<br />

Decreased window advertisement causes the sender to stop at<br />

the boundary of the window<br />

11


<strong>TCP</strong> Sliding Window<br />

Data<br />

AcK’d<br />

Window Size<br />

Outst<strong>and</strong>ing<br />

Un-Ack’d data<br />

Data OK<br />

to send<br />

Data not OK<br />

to send yet<br />

Window is meaningful to the sender<br />

Current window size is “advertised” by receiver<br />

Usually 4k – 8k Bytes when connection set-up<br />

Acknowledgements <strong>and</strong><br />

Retransmission<br />

<br />

<br />

<br />

<strong>TCP</strong> uses a cumulative acknowledgement scheme<br />

ACK reports on the number of octets so far accumulated <strong>and</strong><br />

specifies the next octet expected<br />

Advantages<br />

ACKs are easy to generate unambiguously<br />

Lost ACKs do not necessarily force a retransmission<br />

Disadvantages<br />

Receiver does not report on the segments successfully received, if<br />

one intermediate segment is lost<br />

Timeout mechanism becomes very crucial<br />

<strong>TCP</strong> Window Mechanisms<br />

<strong>TCP</strong> uses a credit allocation scheme<br />

The sender includes the sequence number of the<br />

first octet in the segment data field<br />

The receiver acknowledges the incoming segment<br />

with a message of the form (ACK = i, WIN = j )<br />

<br />

<br />

All octets up to i-1 are acknowledged<br />

Permission is granted to send an additional window of j<br />

octets starting at i through i+j-1<br />

<strong>TCP</strong> Credit Allocation Scheme<br />

The credit allocation scheme is flexible<br />

Assume last message issued by receiver carries<br />

ACK = i <strong>and</strong> WIN = j<br />

To increase the credit by an amount k > j, when no additional<br />

data have arrived, the receiver issues ACK = i <strong>and</strong> WIN = k<br />

To acknowledge an incoming segment containing m octets of<br />

data (m < j) without granting additional credits, the receiver<br />

issues ACK = i+m <strong>and</strong> WIN = j-m<br />

The receiver can issue cumulative acks, <strong>and</strong> is not<br />

required to issue an ack immediately after receiving a<br />

segment<br />

12


<strong>TCP</strong> Window Control<br />

Effect of Window Size<br />

Sender<br />

Receiver<br />

1001 2400 1001 2400<br />

<br />

SN 1001<br />

1601 2400 LEN 600<br />

1601 2400<br />

ACK 1601<br />

1601 2400 WIN 800 1601 4000<br />

ACK 1601<br />

1601 4000<br />

WIN 2400<br />

1601 4000<br />

<strong>TCP</strong> Actual Throughput<br />

<br />

<strong>TCP</strong> performance can be affected by several complicating<br />

factors:<br />

In most cases, a number of <strong>TCP</strong> connections are multiplexed over<br />

the same network interface, so each connection is only allocated a<br />

fraction of b<strong>and</strong>width<br />

This reduces R, therefore S as well<br />

<strong>TCP</strong> connections usually involve a hop across multiple networks<br />

Router delay becomes the biggest contributor to D<br />

Actual data rate may be reduced below R if a bottleneck router exists<br />

along the path<br />

If any segment is lost <strong>and</strong> must be retransmitted, throughput is<br />

reduced<br />

<strong>TCP</strong> Sliding Window<br />

Fundamental Questions<br />

How much data can a <strong>TCP</strong> sender have<br />

outst<strong>and</strong>ing in the network?<br />

How much data should <strong>TCP</strong> retransmit<br />

when an error occurs?<br />

Just selectively repeat the missing data?<br />

How does the <strong>TCP</strong> sender avoid overrunning<br />

the receiver’s buffers?<br />

13


<strong>TCP</strong> Flow Control<br />

<strong>TCP</strong> Flow Control<br />

Receiver throttles sender by advertising a<br />

window size no larger than the amount of data it<br />

can buffer.<br />

To avoid buffer overflow at the receiver side, the<br />

following invariant is always true:<br />

LastByteRcvd - LastByteRead


<strong>TCP</strong> Congestion Control<br />

<strong>TCP</strong> CONGESTION CONTROL<br />

<strong>TCP</strong> sliding window provides an adequate<br />

mechanism to control the data flow between the<br />

sender <strong>and</strong> the receiver<br />

<strong>TCP</strong> sender still needs to perform congestion control to<br />

ensure that the network can h<strong>and</strong>le the window agreed upon<br />

by the sender <strong>and</strong> the receiver<br />

<strong>TCP</strong> relies on ACKs to control flow <strong>and</strong> congestion<br />

<br />

<br />

If the ACK is not received on time, <strong>TCP</strong> retransmits the<br />

segment<br />

An absence of an ACK signals congestion in the network<br />

<strong>TCP</strong> Window Control<br />

<strong>TCP</strong> Timeout Timers<br />

<strong>TCP</strong> does not rely on an explicit negative<br />

acknowledgement to detect errors<br />

<strong>TCP</strong> relies exclusively on positive ACKs <strong>and</strong><br />

retransmissions when an ACK does not arrive within<br />

a given period of time<br />

Retransmission TimeOut (RTO) should be on<br />

the order of RTT (Round-Trip Time)<br />

RTT <strong>and</strong> its statistics, however, vary considerably as<br />

the <strong>Internet</strong> load changes<br />

<br />

Thus timers play a major role in <strong>TCP</strong> congestion<br />

Sender<br />

Receiver<br />

Data<br />

Round-trip time (RTT)<br />

ACK<br />

Data<br />

Retransmission TimeOut (RTO)<br />

ACK<br />

Guard<br />

B<strong>and</strong><br />

Estimated RTT<br />

15


Timer Values<br />

The timeout timer should be set to a value<br />

somewhat longer than the RTT<br />

Two strategies can be used to set the timer<br />

value:<br />

Static timer value<br />

Adaptive timer value<br />

Static Value Timers<br />

A fixed value timer based on the <strong>Internet</strong><br />

typical behavior is used for RTO value<br />

The strategy suffers from the inability to respond to<br />

<strong>Internet</strong> changing conditions<br />

<br />

<br />

A high value leads to sluggish service <strong>and</strong> unnecessary<br />

long time of response to a lost packet<br />

A low value leads to a positive feedback condition that<br />

causes more retransmissions (some unnecessary) in a case<br />

of congestion<br />

Adaptive Timers<br />

<br />

<br />

<br />

<strong>TCP</strong> keeps track of the time taken to acknowledge data<br />

segments <strong>and</strong> sets its retransmission timers based on the<br />

average of the observed delays<br />

Can this be trusted?<br />

Cumulative ACK by the peer may delay an acknowledgement<br />

If a segment has been retransmitted, the sender cannot determine<br />

which segment (the original or the retransmitted), the ACK does<br />

correspond to<br />

<strong>Internet</strong> laod may change suddenly<br />

Still, tracks better RTT variation than the fixed strategy<br />

Adaptive Timers<br />

Smoothed Average<br />

<br />

<br />

An estimated value of RTT, SRTT(), is computed as a<br />

smoothed average<br />

SRTT<br />

k<br />

1 <br />

SRTT k<br />

1<br />

RTT<br />

k<br />

1<br />

The smaller the value of , the greater weight of more<br />

recent observations<br />

With small values of , the average quickly reflects a rapid change in<br />

the observed quantity<br />

A disadvantage, a brief surge in the observed value followed by a<br />

return to the average causes strong fluctuations<br />

= 0.9 is recommended<br />

16


Adaptive Timers<br />

Adaptive Timers<br />

<br />

<br />

<br />

Given an SRTT value, what should be the RTO value?<br />

Use a constant value:<br />

RTO k<br />

1 SRTTk<br />

1 <br />

is not proportional to SRTT<br />

If SRTT(k+1) is large, may become insignificant,<br />

relatively<br />

As a consequence, fluctuations in the RTT cause unnecessary<br />

retransmissions<br />

If SRTT(k+1) value is small, may become relatively<br />

large<br />

As a consequence, the mechanism depends solely on value, not<br />

reacting to changing conditions<br />

<br />

<strong>TCP</strong> Congestion Control<br />

<strong>TCP</strong> Congestion Control<br />

The rate at which a <strong>TCP</strong> entity can send data is<br />

determined by the rate of incoming ACKs to<br />

previous segments (with granted credit)<br />

This rate is determined by the “bottleneck” in the<br />

round trip path between the source <strong>and</strong> the<br />

destination<br />

<br />

<br />

<br />

<br />

Congestion control in the <strong>Internet</strong> is complex<br />

IP is a connectionless, stateless protocol<br />

No provision for determining congestion, let alone control it<br />

<strong>TCP</strong> provides only end-to-end flow control <strong>and</strong> relies on<br />

implicit feedback to deduce the presence of congestion<br />

<br />

ICMP quench messages are too crude to provide any effective<br />

control<br />

No cooperative, distributed algorithm exists between<br />

different <strong>TCP</strong> entities<br />

<br />

On the contrary, <strong>TCP</strong> entities may be selfish <strong>and</strong> do not cooperate to<br />

maintain some level of control<br />

17


<strong>TCP</strong> Congestion Control<br />

The only tool is the sliding-window flow<br />

control <strong>and</strong> error control mechanisms<br />

Not enough, in a dynamic environment<br />

A number of clever techniques have been<br />

developed <strong>and</strong> added to control congestion in<br />

the <strong>Internet</strong><br />

Mechanisms for congestion detection<br />

Mechanisms for congestion avoidance<br />

Mechanisms for congestion recovery<br />

<strong>TCP</strong> Flow <strong>and</strong> Congestion Control<br />

<strong>TCP</strong> is self-clocking: returning of ACKs<br />

functions as pacing signal<br />

After an initial burst, the sender’s segment rate<br />

matches the arrival rate of ACKs<br />

<br />

Sender rate equals the rate of the slowest link on the path<br />

This way <strong>TCP</strong> senses the bottleneck <strong>and</strong> regulates its rate<br />

accordingly<br />

<strong>TCP</strong> Flow <strong>and</strong> Congestion Control<br />

Network Bottleneck<br />

Pb is the minimum segment spacing on the slowest link<br />

Pr is the segment spacing at the receiver<br />

As is the ACK spacing at the sender<br />

Ab is the ACK spacing at the bottleneck<br />

Pr<br />

Pb<br />

Data<br />

Pb = Pr = Ab = As<br />

<strong>TCP</strong> Flow <strong>and</strong> Congestion Control<br />

<br />

<br />

Problem caused by the fact that the sender does not know<br />

where is the bottleneck: the network or the receiver<br />

If ACKs arrive relatively slow due to network congestion, then the<br />

sender must transmit segments at a rate slower than the ACKs<br />

On the other h<strong>and</strong>, if slow pace is due to the receiver, this pace<br />

should dictate the transmission policy<br />

Thus, <strong>TCP</strong> sliding window must be enhanced to factor in<br />

the network congestion<br />

ACK<br />

As<br />

Ab<br />

18


Probability<br />

Average Queueing Delay<br />

Improving <strong>TCP</strong> Congestion<br />

Effectiveness<br />

Three techniques have been suggested for<br />

retransmission timer management<br />

RTT Variance Estimation<br />

Exponential Backoff<br />

Karn’s Algorithm<br />

Four techniques have been suggested for<br />

window management<br />

Slow Start<br />

Dynamic Window Sizing on Congestion<br />

Fast Retransmission<br />

Fast Recovery<br />

RTT Variance Estimation<br />

Factors of Influence<br />

RTT exhibits relatively high variance due to three<br />

sources<br />

A low data rate on the <strong>TCP</strong> connection relative to the<br />

progation time makes the transmission delay relatively<br />

important in the RTT estimation<br />

Thus, high variation in datagram sizes induces high variation in<br />

RTT<br />

SRTT estimator can be heavily influenced by characteristics of<br />

the data, that have nothing to do with the network<br />

Abrupt changes in the <strong>Internet</strong> load <strong>and</strong> condition<br />

<strong>TCP</strong> receiver may use cumulative ACKs<br />

RTO Scheme Revisited<br />

<strong>TCP</strong> Timeout Estimates<br />

<strong>TCP</strong> original scheme suggests to compute<br />

RTO as follows<br />

RTO k<br />

1 SRTTk<br />

1 ; 2<br />

There will be some (unknown) distribution<br />

of RTTs.<br />

We are trying to estimate an RTO to<br />

minimize the probability of a false timeout.<br />

Router queues grow when there is more<br />

traffic, until they become unstable.<br />

As load grows, variance of delay grows<br />

rapidly.<br />

If RTT has low variance, it results in<br />

unnecessary high values of RTO<br />

In unstable environments, it may be<br />

inadequate to protect against retransmissions<br />

variance<br />

mean<br />

RTT<br />

Variance<br />

grows rapidly<br />

with load<br />

Load<br />

(Amount of traffic<br />

arriving to router)<br />

19


RTT Variance Estimates<br />

A more effective approach is to estimate<br />

RTT st<strong>and</strong>ard deviation<br />

However, too costly as it involves square <strong>and</strong><br />

square root calculation<br />

A less expensive alternative: the mean<br />

deviation<br />

MDEV X E X E X<br />

<br />

<br />

<br />

Dynamic Estimate of RTT<br />

Variability<br />

The algorithm to dynamically estimate the<br />

variability in estimating RTT uses an<br />

exponential smoothing technique<br />

SRTT k<br />

1 SRTT k<br />

<br />

1<br />

<br />

RTT<br />

k<br />

1<br />

SERR k<br />

1 RTT k<br />

1 SRTT k<br />

1<br />

SDEV k<br />

1 '<br />

SDEV k<br />

<br />

1<br />

'<br />

SERR<br />

k<br />

1<br />

RTOk<br />

1 SRTT k<br />

1 SDEV k<br />

1<br />

<br />

7<br />

;<br />

8<br />

'<br />

<br />

3<br />

;<br />

4<br />

4<br />

<strong>TCP</strong> Congestion Control<br />

RTT variance estimation is efficient in detecting<br />

data loss<br />

It may not be sufficient in a case of retransmission<br />

Two other factors must be taken into consideration<br />

when managing retransmission timers in a case of<br />

congestion:<br />

What RTO value should be used on a retransmitted<br />

segment (i.e., when timeout occurs)?<br />

Exponential RTO Backoff<br />

What RTT samples should be used as input to compute<br />

SRTTT( ) in a case of retransmission?<br />

Karn’s Algorithm<br />

Exponential RTO Backoff<br />

Timeouts are usually due to congestion<br />

Maintaining the same RTO value for retransmission is ill<br />

advised<br />

<br />

May cause sustained congestion due to the development of a<br />

similar pattern of behavior of all active <strong>TCP</strong> connections<br />

All sources wait for local RTO time <strong>and</strong> retransmit again<br />

A more efficient scheme must try to increase RTO values<br />

to give the <strong>Internet</strong> more time to clear the congestion<br />

Exponential RTO Backoff scheme<br />

RTO = q RTO<br />

Typically q = 2<br />

20


RTT Samples Ambiguity<br />

Ambiguity may occur when a timeout occurs<br />

<strong>and</strong> the same segment is retransmitted<br />

Upon receiving an ACK, the sender has no way of<br />

determining which segment, the first or the second,<br />

has been acknowledged<br />

Feeding the resulting RTT into the RTO<br />

algorithm may lead to false measurements <strong>and</strong><br />

oscilations<br />

Retransmission <strong>and</strong> Timeouts<br />

<strong>TCP</strong> Source cannot distinguish between these<br />

two cases<br />

Wrong RTT<br />

Sample<br />

Host A<br />

Retransmission<br />

Host B<br />

Wrong RTT<br />

Sample<br />

Host A<br />

Retransmission<br />

How can we estimate RTT when packets are<br />

retransmitted?<br />

Host B<br />

RTT Samples – Karn’s Algorithm<br />

Karn’s algorithm uses the following rules<br />

Do NOT use the measured RTT for a retransmitted<br />

segment to update SRTT <strong>and</strong> SDEV<br />

Calculate the backoff RTO when a retransmission occurs<br />

RTO = q RTO ; (typically q = 2)<br />

Use the backoff RTO value for succeeding segment until<br />

an acknowledgement arrives for a segment that has not<br />

been retransmitted<br />

When the acknowledgement is received to an unretransmitted<br />

segment, use the normal algorithm to update SRTT <strong>and</strong> SDEV<br />

<strong>and</strong> compute future RTO values<br />

<strong>TCP</strong> Congestion Control<br />

Armed with an efficient RTO estimation mechanism,<br />

use lack of acknowledgements as congestion<br />

indication <strong>and</strong> maintain a congestion window to<br />

estimate congestion in the network<br />

CongestionWindow :: a variable maintained by the <strong>TCP</strong><br />

source for each connection.<br />

<strong>TCP</strong> is modified such that the maximum number of<br />

bytes of unacknowledged data allowed is the<br />

minimum of CongestionWindow <strong>and</strong><br />

AdvertisedWindow<br />

AllowedWindow :: Min (CongestionWindow , AdvertisedWindow)<br />

21


Window Management Techniques<br />

A number of strategies have been suggested to<br />

effectively manage the sending window<br />

Slow Start<br />

Dynamic Window Sizing on Congestion<br />

Fast Retransmission<br />

Fast Recovery<br />

These techniques are incorporated in all<br />

modern implementations of <strong>TCP</strong><br />

Slow Start<br />

<strong>TCP</strong> is self-clocking <strong>and</strong> paces itself<br />

approximately to the rate of the bottleneck<br />

At the initialization phase, no such pacing is<br />

available for guidance<br />

Sending at full window speed is ill-advised as it<br />

may cause large packet drops<br />

Some means of gradually exp<strong>and</strong>ing the initial<br />

window until pacing takes over is needed<br />

“Slow Start” is the procedure recommended for<br />

initial window expansion until packing takes over<br />

Phases of Congestion Control<br />

<br />

<strong>TCP</strong> Slow Start<br />

The goal of <strong>TCP</strong> Slow Start is to discover<br />

roughly the proper sending rate quickly<br />

Whenever starting traffic on a new connection,<br />

or whenever increasing traffic after congestion<br />

was experienced:<br />

Intialize cwnd =1<br />

Each time a segment is acknowledged, increment<br />

cwnd by one (cwnd++).<br />

Continue until<br />

The threshold, ss_thresh, is reached or Packet Loss<br />

The variable ssthresh can be thought of as an estimate of the level below which<br />

congestion is not expected.<br />

22


<strong>TCP</strong> Slow Start<br />

Dynamic Window Sizing on Congestion<br />

Each ACK adds 2<br />

“credits”<br />

This gives the sender the<br />

permission to send two<br />

segments<br />

Slow start increases<br />

rate exponentially fast<br />

Rate is doubled every<br />

RTT!<br />

cwnd = 1<br />

cwnd = 2<br />

cwnd = 4<br />

cwnd = 8<br />

Slow Start enables the sender to quickly<br />

determine a reasonable window size for the<br />

connection<br />

What happens if packet loss (timeout) occurs?<br />

<br />

<br />

<br />

This could be a sign of congestion, but it is not clear how<br />

serious the congestion is<br />

Would resetting cwnd = 1 <strong>and</strong> restarting slow-start be<br />

enough?<br />

This may not be conservative enough<br />

Need a more aggressive response<br />

Congestion Avoidance: Additive<br />

Increase<br />

<br />

<br />

<br />

After exiting slow start, slowly increase cwnd to probe for<br />

additional available b<strong>and</strong>width<br />

Competing flows may end transmission<br />

May have been “unlucky” with an early drop<br />

If cwnd > ss_thresh then<br />

each time a segment is acknowledged<br />

increment cwnd by 1/cwnd (cwnd += 1/cwnd).<br />

cwnd is increased by one only if all segments have been<br />

acknowledged<br />

Increases by 1 per RTT, vs. doubling per RTT<br />

Congestion Avoidance: Additive<br />

Increase<br />

When a timeout occurs, use these rules:<br />

Set a slow-start threshold, ssthresh, equal to half of<br />

cwnd<br />

ssthresh = cwnd / 2<br />

Set cwnd = 1 <strong>and</strong> perform slow start until cwnd =<br />

ssthresh<br />

At that time, cwnd is increased only by 1, for every ACK<br />

received<br />

For cwnd ssthresh, increase cwnd by 1 for each<br />

RTT<br />

91<br />

23


Cwnd (in segments)<br />

Example of Slow Start + Congestion<br />

Avoidance<br />

cwnd = 1<br />

Assume that ss_thresh = 8<br />

14<br />

12<br />

10<br />

8<br />

ssthresh<br />

6<br />

4<br />

2<br />

0<br />

t=0<br />

t=2<br />

t=4<br />

Roundtrip times<br />

t=6<br />

cwnd = 2<br />

cwnd = 4<br />

cwnd = 8<br />

cwnd = 9<br />

cwnd = 10<br />

<strong>TCP</strong> Improvement<br />

FAST RETRANSMIT<br />

FAST RECOVERY<br />

Acknowledgments in <strong>TCP</strong><br />

Fast Retransmission Policy – <strong>TCP</strong> Tahoe<br />

• Receiver sends ACK to sender<br />

ACK is used for flow control, error<br />

control, <strong>and</strong> congestion control<br />

• ACK number sent is the next sequence<br />

number expected<br />

• Delayed ACK: <strong>TCP</strong> receiver normally<br />

delays transmission of an ACK (for<br />

about 200ms)<br />

Why?<br />

• ACKs are not delayed when packets are<br />

received out of sequence<br />

Why?<br />

1K SeqNo=0<br />

AckNo=1024<br />

1K SeqNo=1024<br />

AckNo=2048<br />

1K SeqNo=2048<br />

1K SeqNo=3072<br />

AckNo=2048<br />

Lost segment<br />

When a segment is lost, original <strong>TCP</strong> waits for an<br />

ACK that’s not coming <strong>and</strong> eventually times-out.<br />

It is often the case that of the segments sent after the lost<br />

segment arrive at the receiver.<br />

For each segment received, the receiver sends a<br />

duplicate ACK, notifying the sender that the receiver<br />

is waiting for the missing segment.<br />

<strong>TCP</strong> Tahoe interprets duplicate ACK’s as an<br />

indication that a segment was lost.<br />

95<br />

24


Fast Retransmit<br />

Fast Recovery<br />

cwnd=12<br />

sshtresh=5<br />

1K SeqNo=0<br />

1K SeqNo=0<br />

• If three or more duplicate ACKs<br />

are received in a row, the <strong>TCP</strong><br />

sender considers the segment<br />

as lost.<br />

• Then <strong>TCP</strong> performs a<br />

retransmission of the missing<br />

segment, without waiting for a<br />

timeout to happen.<br />

duplicate<br />

duplicate<br />

AckNo=1024<br />

AckNo=1024<br />

AckNo=1024<br />

1K SeqNo=1024<br />

1K SeqNo=2048<br />

1K SeqNo=3072<br />

1K SeqNo=1024<br />

• Fast recovery avoids slow start after a<br />

fast retransmit<br />

• Intuition: Duplicate ACKs indicate that<br />

data is getting through<br />

• After three duplicate ACKs set:<br />

Retransmit “lost packet”<br />

ssthresh = cwnd/2<br />

cwnd = cwnd+3<br />

Enter congestion avoidance<br />

Increment cwnd by one for each<br />

additional duplicate ACK<br />

cwnd=12<br />

sshtresh=5<br />

cwnd=12<br />

sshtresh=5<br />

cwnd=12<br />

sshtresh=5<br />

AckNo=1024<br />

AckNo=1024<br />

AckNo=1024<br />

1K SeqNo=1024<br />

1K SeqNo=2048<br />

1K SeqNo=3072<br />

1K SeqNo=1024<br />

1K SeqNo=4096<br />

• Enter slow start:<br />

ssthresh = cwnd/2<br />

cwnd = 1<br />

1K SeqNo=4096<br />

• When ACK arrives that acknowledges<br />

“new data” (here: AckNo=2028), set:<br />

cwnd=ssthresh<br />

enter congestion avoidance<br />

cwnd=9<br />

sshtresh=9<br />

AckNo=2048<br />

97<br />

98<br />

Fast Recovery<br />

Summary of <strong>TCP</strong> Behavior<br />

cwnd<br />

(initial) ssthresh<br />

fast-retransmit<br />

fast-retransmit<br />

new ACK<br />

new ACK<br />

Slow Start Congestion Avoidance<br />

timeout<br />

Time<br />

<strong>TCP</strong><br />

Variation<br />

Tahoe<br />

Reno<br />

NewReno<br />

Response to 3<br />

dupACK’s<br />

Do fast retransmit,<br />

enter slow start<br />

Do fast retransmit,<br />

enter fast recovery<br />

Do fast retransmit,<br />

enter modified fast<br />

recovery<br />

• When entering slow start, if connection is<br />

new,<br />

ssthresh = arbitrarily large value<br />

cwnd = 1.<br />

else,<br />

ssthresh = max(flight size/2, 2*MSS)<br />

cwnd = 1.<br />

• In slow start ++cwnd on new ACK<br />

Response to Partial ACK<br />

of Fast Retransmission<br />

++cwnd<br />

Exit fast recovery, deflate<br />

window, enter congestion<br />

avoidance<br />

Fast retransmit <strong>and</strong> deflate<br />

window – remain in<br />

modified fast recovery<br />

Response to “full” ACK of<br />

Fast Retransmission<br />

++cwnd<br />

Exit fast recovery, deflate<br />

window, enter congestion<br />

avoidance<br />

Exit modified fast recovery,<br />

deflate window, enter<br />

congestion avoidance<br />

• When entering either fast recovery or<br />

modified fast recovery,<br />

ssthresh = max(flight size/2, 2*MSS)<br />

cwnd = ssthresh.<br />

• In congestion avoidance<br />

cwnd += 1*MSS per RTT<br />

25


Summary<br />

Transmission Control Protocol Discussion<br />

Segment format <strong>and</strong> functionality<br />

<strong>TCP</strong> Flow Control<br />

<strong>TCP</strong> Congestion Control<br />

<br />

AIMD strategy<br />

26

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!