Network Monitoring

Introduction

As with other subsystems which contain instance data, you can monitor both summary (in brief and verbose modes) and detail data. Like disk data, the key brief mode values are bytes and packets (rather than iops). The actual data comes from /proc/net/dev.

The one key thing to keep in mind with network data is that not all networks are the same. Just like there are device mapper disks that shouldn't be included in the summary data the same is true for network devices. Those that are not included in the summary are the loopback, sit, bond and vmnet devices,

Since most lan networks run fairly cleanly and errors are rare, one is usually not interested in seeing long columns of zeros that never change and so by default brief mode does not include any error information. Adding --netopts e will add an additional column with a total error count. To see specific errors one would have to run in summary more and do identify the specific networks on which those errors were occurring you would have to run in detail mode.

What about IB over IP?
Good question. When using Infiniband networking you typically get an IB network device created. So does this mean IB traffice gets counted twice when you monitor both it and network data? As they say, it depends. Some Infiniband data will indeed go over the native IB interface and never show up as network data. This includes MPI traffic or lustre which uses the native IB transport. However, other uses of Infiniband may in fact be counted as network traffic. BUT this is actually a good thing because if you're a heavy user of IB/IP and want to be able to differentiate the native IB traffic from it, simply look at the network detail data and subtract any IB network numbers from the native values.

Tips and Tricks
Ever try looking for a needle in a haystack, in this case maybe it's network errors? --network E works just like its lowercase cousin except it tells collectl to only report intervals that have network errors in them. While this can be extremely boring in real-time mode, consider what happens during playback. During the course of a day you'll have 8640 samples but this switch will allow you to see the one that recorded the network error!

Filtering
If you'd like to limit the networks included in either the detail output or the summary totals, you can explicitly include or exclude them using --netfilt. The target of this switch is actually one or more perl expressions, but if you don't know perl all you really need to know is these are strings that are compared to each network name. If the first (or only) name is preceded with a ^, names that match the string(s) will be excluded.

This switch works exactly the same way as --dskfilt so for use case examples see Disk Monitoring.

Dynamic Network Discovery
Network devices typically don't change unless you install a new network card and reboot the machine, in which case collectl when the list of known networks is created it's simply correct. However, especially when running on a host that creates virtual machines, networks can come and go. The way collectl does this is to simply maintain a dynamic list of the active networks and make sure the statistics get associated with the correct network. As a way to preserve network details in plot format, collectl further retains a list of all the networks ever seen since system boot and uses that list to record the statistics, thereby keeping the columnar data consistent. In many cases where hypervisors reuse the virtual network names, the list of unique names remains relatively low.

However it was recently discovered that the OpenStack Quantum service for handling dynamic networks actually uses a new network name every time a VM is created. If a particular host has dozens of VMs and they come and go many times, the result can be a very long list of network names which not only show up in all the headers of all the files collectl generates, they will also show up as unique sets of data in the network detail file if one is generated. On one long running host, over 500 unique network names were generated!

The way collectl has chosen to deal with this is with --netopts o, which tells collectl to not preserve the ordered list of all the networks that have been seen and will in fact cause collectl to prune that list whenever a network is found to have been removed. While this solves the immediate need to keep only the active networks in the headers of the output files it creates a different problem with the detail files in that the data will no longer be consistent. In fact, it is probably best to not even try to generate detail files for dynamic network when using this switch. As of this writing I'm still trying to think of ways to improve the situation.
updated Sept 4, 2013