25.10.2016 Views

Network Fault Detection and Isolation

59-Active-Fault-Finding-in-Networks-RIPE-Madrid

59-Active-Fault-Finding-in-Networks-RIPE-Madrid

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Network</strong> <strong>Fault</strong> <strong>Detection</strong> <strong>and</strong> <strong>Isolation</strong><br />

Richard Sheehan<br />

Production Engineer


1<br />

<strong>Network</strong> monitoring<br />

2<br />

<strong>Network</strong>s @Scale<br />

3<br />

NetNorad<br />

4<br />

NFI


No <strong>Network</strong> Monitoring<br />

People tell you when<br />

the network is broken!


People YOU tell you investigate when they EVERYTHING, think the network every time is broken!


Output Utilization (bps) Input Utilization (bps) Errors<br />

0<br />

03:0 06:0 09:0 12: 15: 18: 21: 7. 03:0 06:0 09:0 12: 15: 18: 21: 7. 03:0 06:0 09:0 12: 15: 18: 21: 7.


Detecting packet loss<br />

St<strong>and</strong>ard counters<br />

Non-st<strong>and</strong>ard counters<br />

SNMP<br />

ssw001# show platform trident counters<br />

Debug counters<br />

Description<br />

T2Fabric19/0/1 RX - Non congestion discards<br />

T2Fabric19/0/1 TX - IPV4 L3 unicast aged <strong>and</strong> dropped pkts<br />

T2Fabric19/0/1 RX - Receive policy discard<br />

T2Fabric19/0/1 TX - L2 multicast drop<br />

T2Fabric19/0/1 RX - Tunnel error packets<br />

T2Fabric19/0/1 TX - Invalid VLAN<br />

T2Fabric19/0/1 RX - Receive VLAN drop<br />

T2Fabric19/0/1 RX - Receive multicast drop<br />

T2Fabric19/0/1 TX - Dropped because TTL counter<br />

T2Fabric19/0/1 RX - Receive uRPF drop<br />

T2Fabric19/0/1 TX - Packet dropped due to any condition<br />

T2Fabric19/0/1 RX - IBP discard <strong>and</strong> CPB full<br />

T2Fabric19/0/1 TX - Miss in VXLT table counter


Fundamental<br />

Fatal Flaw


If you didn’t design If you <strong>and</strong> did build design the <strong>and</strong> hardware build it<strong>and</strong> software<br />

You You really shouldn’t shouldn’t trust trust it to it tell to you tell you it’s ok. it’s ok.


1<br />

<strong>Network</strong> monitoring<br />

2<br />

<strong>Network</strong>s @Scale<br />

3<br />

NetNorad<br />

4<br />

NFI


Data Center <strong>Network</strong>s<br />

CLOS Fabric


. . .<br />

. . .<br />

. . .<br />

1 R<br />

511<br />

. . .<br />

. . .<br />

. . .<br />

. . .<br />

. . .<br />

512 R<br />

1024<br />

. . .<br />

. . .<br />

. . .<br />

. . .


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

Datacenter <strong>Network</strong><br />

csw.dc2<br />

CLOS Fabric<br />

host.dc1<br />

rsw26.dc1<br />

ECMP Hashing<br />

rsw45.dc2<br />

host.dc2


ECMP<br />

Equal-Cost Multi-Path


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2


TCP/IP Packet<br />

Version IHL Type of Service<br />

Identification<br />

Flags<br />

Total Length<br />

Fragment Offset<br />

Time to Live<br />

Protocol=6 (TCP)<br />

Header Checksum<br />

IP Header<br />

Source Address<br />

Destination Address<br />

Options<br />

Padding<br />

TCP<br />

Source Port<br />

Destination Port<br />

Sequence Number<br />

Acknowledgement Number<br />

Data<br />

Offset<br />

U<br />

R<br />

G<br />

A<br />

C<br />

K<br />

P<br />

S<br />

H<br />

R<br />

S<br />

T<br />

S<br />

Y<br />

N<br />

F<br />

I<br />

N<br />

Window<br />

Checksum<br />

Urgent Pointer<br />

TCP Options<br />

Padding<br />

TCP Data


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2


fbtracert<br />

github.com/facebook/fbtracert


Src ports<br />

37701<br />

32702<br />

32703<br />

32704<br />

TTL 1<br />

csw.dc1 ssw.dc1 csw.dc1<br />

host1.dc1 rsw26.dc1<br />

rsw45.dc1 host2.dc1


Src ports<br />

37701<br />

32702<br />

32703<br />

32704<br />

TTL 1<br />

Src ports<br />

37701<br />

32703<br />

TTL 2<br />

csw.dc1 ssw.dc1 csw.dc1<br />

host1.dc1 rsw26.dc1<br />

rsw45.dc1 host2.dc1<br />

Src ports<br />

32702<br />

32704<br />

TTL 2


Src ports<br />

32703<br />

TTL 3<br />

Src ports<br />

37701<br />

32702<br />

32703<br />

32704<br />

TTL 1<br />

Src ports<br />

37701<br />

32703<br />

TTL 2<br />

Src ports<br />

32703<br />

TTL 3<br />

csw.dc1 ssw.dc1 csw.dc1<br />

host1.dc1 rsw26.dc1<br />

rsw45.dc1 host2.dc1<br />

Src ports<br />

32702<br />

32704<br />

TTL 2<br />

Src ports<br />

32704<br />

TTL 3<br />

Src ports<br />

32702<br />

TTL 3


Src ports<br />

32703<br />

TTL 3<br />

Src ports<br />

37701<br />

32702<br />

32703<br />

32704<br />

TTL 1<br />

Src ports<br />

37701<br />

32703<br />

TTL 2<br />

Src ports<br />

32703<br />

TTL 3<br />

Src ports<br />

37701<br />

32703<br />

TTL 4<br />

csw.dc1 ssw.dc1 csw.dc1<br />

host1.dc1 rsw26.dc1<br />

rsw45.dc1 host2.dc1<br />

Src ports<br />

32702<br />

32704<br />

TTL 2<br />

Src ports<br />

32704<br />

TTL 3<br />

Src ports<br />

32702<br />

32704<br />

TTL 4<br />

Src ports<br />

32702<br />

TTL 3


Src ports<br />

32703<br />

TTL 3<br />

Src ports<br />

37701<br />

32702<br />

32703<br />

32704<br />

TTL 1<br />

Src ports<br />

37701<br />

32703<br />

TTL 2<br />

Src ports<br />

32703<br />

TTL 3<br />

Src ports<br />

37701<br />

32703<br />

TTL 4<br />

Src ports<br />

37701<br />

32702<br />

32703<br />

32704<br />

TTL 5<br />

csw.dc1 ssw.dc1 csw.dc1<br />

host1.dc1 rsw26.dc1<br />

rsw45.dc1 host2.dc1<br />

Src ports<br />

32702<br />

32704<br />

TTL 2<br />

Src ports<br />

32704<br />

TTL 3<br />

Src ports<br />

32702<br />

32704<br />

TTL 4<br />

Src ports<br />

32702<br />

TTL 3


Src ports<br />

32703<br />

TTL 3<br />

Src ports<br />

37701<br />

32702<br />

32703<br />

32704<br />

TTL 1<br />

Src ports<br />

37701<br />

32703<br />

TTL 2<br />

Src ports<br />

32703<br />

TTL 3<br />

Src ports<br />

37701<br />

32703<br />

TTL 4<br />

Src ports<br />

37701<br />

32702<br />

32703<br />

32704<br />

TTL 5<br />

Src ports<br />

37701<br />

32702<br />

32703<br />

32704<br />

TTL 6<br />

csw.dc1 ssw.dc1 csw.dc1<br />

host1.dc1 rsw26.dc1<br />

rsw45.dc1 host2.dc1<br />

Src ports<br />

32702<br />

32704<br />

TTL 2<br />

Src ports<br />

32704<br />

TTL 3<br />

Src ports<br />

32702<br />

32704<br />

TTL 4<br />

Src ports<br />

32702<br />

TTL 3


1<br />

<strong>Network</strong> monitoring<br />

2<br />

<strong>Network</strong>s @Scale<br />

3<br />

NetNorad<br />

4<br />

NFI


Massive pinging<br />

Pingers<br />

Responders


NetNORAD evolution<br />

PING TCP<br />

ICMP UDP


Signature<br />

SentTime<br />

UDP Probe format<br />

How to ping <strong>and</strong> process data?<br />

ReceivedTime<br />

ResponseTime<br />

Traffic Class


UdpPinger<br />

github.com/facebook/UdpPinger


How to ping <strong>and</strong> process data?


Prineville, OR<br />

Forest City, NC


Pinging in a cluster<br />

CSW—<strong>Network</strong> router<br />

RSW— Rack switch<br />

Pinger<br />

Target<br />

Target


Prineville, OR<br />

Forest City, NC


Prineville, OR<br />

Forest City, NC


1<br />

<strong>Network</strong> monitoring<br />

2<br />

<strong>Network</strong> @Scale<br />

3<br />

NetNorad<br />

4<br />

NFI


Alarming should be as accurate as possible<br />

with a clear path towards auto-remediation


. . .<br />

. . .<br />

. . .<br />

1 R<br />

511<br />

. . .<br />

. . .<br />

. . .<br />

Where is the<br />

bad link?<br />

. . .<br />

. . .<br />

512 R<br />

1024<br />

. . .<br />

. . .<br />

. . .<br />

. . .


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2<br />

Lots of TCP Traceroutes<br />

Lots of TCP Traceroutes


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2<br />

Lots of TCP Traceroutes<br />

Lots of TCP Traceroutes


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2<br />

Lots of TCP Traceroutes<br />

Lots of TCP Traceroutes


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2<br />

Lots of TCP Traceroutes<br />

Lots of TCP Traceroutes


Src Port Dst Port Src IP Dst IP Links<br />

e1.csw1.dc1<br />

e5.ssw3.dc1<br />

50000 9999 10.0.0.10 10.99.99.99<br />

e19.idc6.dc1<br />

e3.ssw4.dc2<br />

e7.csw4.dc2<br />

e1.csw4.dc1<br />

e9.ssw2.dc1<br />

50001 9999 10.0.0.10 10.99.99.99<br />

e4.idc2.dc1<br />

e8.ssw3.dc2<br />

e7.csw3.dc2<br />

e1.csw3.dc1<br />

e5.ssw5.dc1<br />

50002 9999 10.0.0.10 10.99.99.99<br />

e19.idc6.dc1<br />

e3.ssw4.dc2<br />

e7.csw4.dc2


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2<br />

Lots of Thrift requests<br />

Lots of TCP<br />

Traceroutes<br />

Lots of TCP<br />

Lots of Thrift requests<br />

Traceroutes


TCP/IP Packet<br />

Version IHL Type of Service<br />

Identification<br />

Flags<br />

Total Length<br />

Fragment Offset<br />

Time to Live<br />

Protocol=6 (TCP)<br />

Header Checksum<br />

IP Header<br />

Source Address<br />

Destination Address<br />

Options<br />

Padding<br />

TCP<br />

Source Port<br />

Destination Port<br />

Sequence Number<br />

Acknowledgement Number<br />

Data<br />

Offset<br />

U<br />

R<br />

G<br />

A<br />

C<br />

K<br />

P<br />

S<br />

H<br />

R<br />

S<br />

T<br />

S<br />

Y<br />

N<br />

F<br />

I<br />

N<br />

Window<br />

Checksum<br />

Urgent Pointer<br />

TCP Options<br />

Padding<br />

TCP Data


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2


Error count<br />

2<br />

1<br />

0<br />

sw01-e2.dc1 ssw06-e2.dc1 ssw01-e2.dc1 idc05-e1.dc1 idc05-e6.dc1 ssw02-e5.dc2 sw02-e2.dc2 sw01-e2.dc2 rsw45-e2.dc2 rsw45-e4.dc2


DC1<br />

idc.dc1<br />

DC2<br />

48<br />

48<br />

ssw.dc1<br />

ssw.dc2<br />

24<br />

24<br />

csw.dc1<br />

csw.dc2<br />

4<br />

4<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2


Error Count for a Device<br />

=<br />

% of requests we lost<br />

*<br />

Number of ECMP links at that layer in the <strong>Network</strong><br />

*<br />

% of lost requests this circuit was in


Error count<br />

2<br />

1<br />

0<br />

sw01-e2.dc1 ssw06-e2.dc1 ssw01-e2.dc1 idc05-e1.dc1 idc05-e6.dc1 ssw02-e5.dc2 sw02-e2.dc2 sw01-e2.dc2 rsw45-e2.dc2 rsw45-e4.dc2


1000<br />

900<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0


Done is better than perfect?


Modular components<br />

NFI Client<br />

NFI Server<br />

Stats aggregation<br />

What is latency/loss <br />

+ publishing<br />

to get to DST hosts?<br />

NFIClient<br />

(Python)<br />

NFIRequestor<br />

(Python)<br />

Thrift Requestor<br />

(Python)<br />

NFIThriftServer<br />

(Python)<br />

NFITopology<br />

(Python)<br />

Ztrace<br />

(C)<br />

What path do I take <br />

to get to DST hosts?


Software evolution<br />

NFI Client<br />

NFI Server<br />

NFIClient<br />

(Python)<br />

NFIRequestor<br />

(Python)<br />

NFITopology<br />

(Python)<br />

Thrift Requestor<br />

(Python) (C++)<br />

Ztrace<br />

(C)<br />

NFIThriftServer<br />

(Python) (C++)


Server<br />

Client<br />

9<br />

8<br />

7<br />

6<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

5<br />

0.8<br />

4<br />

3<br />

2<br />

1<br />

0<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

11:00pm 01:00am 03:00am 05:00am<br />

01:00 02:00 03:00


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2


Reverse Traceroute<br />

NFI Client<br />

NFI Server<br />

NFIClient<br />

(Python)<br />

NFIRequestor<br />

(Python)<br />

NFITopology<br />

(Python)<br />

Thrift Requestor<br />

(C++)<br />

Ztrace<br />

(C)<br />

NFIThriftServer<br />

(C++)<br />

Ztrace<br />

(C)


DC1<br />

idc.dc1<br />

DC2<br />

ssw.dc1<br />

ssw.dc2<br />

csw.dc1<br />

csw.dc2<br />

host.dc1<br />

rsw26.dc1<br />

rsw45.dc2<br />

host.dc2


What's next for NFI?


Running on Wedge


Backbone network<br />

DR<br />

DR<br />

BB<br />

MPLS core<br />

BB<br />

Datacenter<br />

network<br />

Datacenter<br />

network<br />

DR<br />

BB<br />

MPLS LSP<br />

BB<br />

DR


Distributed Systems Hate Packet-Loss

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!