Troubleshooting a slow network issue on Linux is not an easy task. We are going to dive into this issue with network metrics. Setting up a monitor on each network metric can help us speed up the troubleshooting process. Hope this post can give you more ideas on how to fix a slow network issue on Linux.
We will cover these topics today.
Below are some key network metrics on Linux. It is hard to get all these metrics from OS. Luckily, we can use telegraf to collect these metrics and send them to influxdb and use grafana to show them. We can get more info from here.
Packets Error or drop on the network interface
TCP connection status
Ping latency/packet loss
Application port latency
Check Packets Errors or drops on network interface
This is usually related to physical layer like broken SFP, broken ethernet adapter or network configuration mismatch with switch side. We got this problem on Oracle RAC interconnect env. The MTU configuration between host and switch doesn't match. There are a lot of packets dropped because of this.
From the chart, we can easily see the drop/errors on all the interfaces(this is a good example).
Check IP Fragmentation on Linux
As the name implies, IP fragmentation occurs when the receiving system cannot handle a datagram in its full form (limited MTU), and therefore the datagram is fragmented to accommodate the recipient MTU.
An IP packet is broken down into smaller pieces if the packet size exceeds the data link layer protocol limits. This is commonly known as fragmentation, and the process can take place at the originating device or intermediate routers. In order to retrieve the original message, the packet must be reassembled at the destination device. Intermediate routers can fragment packets, but it cannot reassemble them because fragments do not always take the same routes from source to destination.
Monitor IP Reassemblies on Linux
IP reassembly occurs at the final recipient of the message after all fragmented datagrams have taken whatever lowest cost path was available to them have arrived. Attempting to do the latter at an intermediate step has a few challenges.
By default, the TCP implementation will send packets marked for no fragmentation, and listen for ICMP responses (from routers) indicating that packets could not be forwarded due to MTU problems. It can then decrease the packet size used for the connection.
In the Oracle RAC environment, we use UDP as heartbeat. We need to pay attention to this metric.
Monitor TCP retransmission on Linux
This is a very important metric on the TCP layer. If there is a lot of TCP retransmissions, we can tell that there must be something wrong between the two endpoints. We need to engage the network team to troubleshoot further.
We can also monitor more metrics like outreset, passive opens, active opens etc.