Sampling UDP Packets w/ TCPDump Bit-Masking

2021-08-31

This post, assumes that you have a solid understanding of tcpdump bit-masking, if you need a refresher you can check out my other post: tcpdump Bit-Masking (with Sticky-Notes!)

WHY SAMPLE PACKETS INSTEAD OF FULL PACKET CAPTURE?

If you are on a busy network and want to get a feel for what is running on that network, dumping 100% of packets will create a cumbersome file very quickly. You may find yourself having to use tiny capture windows to keep the size down which may not give you a complete view of what types of activity if it is happening outside of that small time frame. Sampling is a method that allows you to capture a representative portion of the traffic that is travelling on a network without capturing every single packet. The advantage of this is that you can now capture for much longer and increase the chances of you finding activity that isn’t happening constantly. However, due to the nature of sampling you only have pieces of the picture and cannot reconstruct a full conversation between two hosts.

TPC BOTTLENECK

In regards to sampling, the TCP protocol offers a convenient sampling method in the form of the TCP handshake. In order for data to be sent/received there must be a handshake to establish a connection. The SYN-ACK portion is present exactly once in every successful connection and is a perfect capture method to get an idea of what types of activity is occurring on a network.
This can be accomplished with the following BPF filter: tcp[13] & 0x3F = 0x12

UDP SAMPLING CHALLENGE

The UDP protocol doesn’t require the handshake that TCP does in order to start sending data. There are no flags or fields in the UDP protocol to filter on that will ensure that you are getting at least one packet from a given conversation. Luckily we can lean on checksums “randomness” to provide a method to capture a “random” subsection of the traffic.

WHY NOT UDP CHECKSUM?

Initially I tried to use the UDP checksum for randomization of packet capture but my results came back skewed and uneven. After thinking about it, I realized that the UDP checksum is calculated from three UDP header fields (source port, destination port, and length). In UDP communication, it’s not uncommon for multiple packets to have the same source port, destination port and size. This led to checksum collisions and inadvertently excluded large amounts of traffic.

THE CHAOS OF THE IP CHECKSUM

The IP checksum is calculated from 11 fields in the IP header and for the purpose of sampling is “random” enough. If a single bit within a single field changes, it will calculate an entirely different value for the checksum.

Example of two checksums calculated from two packets with only a single bit changed:

Packet 1: 0xF3CC (1111 0011 1100 1100)
Packet 2: 0x33CD (0011 0011 1100 1101)

LEVERAGING THE IP CHECKSUM

A bit mask can be used to isolate certain bits from the last byte of the IP checksum (ip[11]) to only include checksums that end with certain bits. I have provided a table that illustrates how the bit and bytes are related and how different filters capture different percentages of the traffic.

IP[11] & 0X01 = 0X01

If we mask everything but the last bit in the checksum and then require it to be a 1 , we capture any checksum that ends in hex 1,3,5,7,9,B,D,F while discarding anything that ends in hex 0,2,4,6,8,A,C,E. Because the IP checksum is fairly random, this works great for sampling roughly half of the traffic.

IP[11] & 0X03 = 0X03

Masking everything but the last two bits and requiring them to both be 1’s narrows our sampling even further. Now we capture IP checksums ending with the hex characters 3,7,B,F and discarding the ones that end with hex 0,1,2,4,5,6,8,9,A,C,D,E. This results in capturing roughly a quarter of the traffic.

IP[11] & 0X07 = 0X07

Preserving the last three bits and requiring them to be 1’s will capture only checksums that end with the hex characters 7 and f. The checksums that end in 1,2,3,4,5,6,8,9,A,B,C,D,E are discarded, resulting in only 12.5% of traffic captured.

TESTING THE THEORY

CHECKSUM VALUE DISTRIBUTION

Mathematically I expected the distribution of the LAST nibble of the IP checksum to be even across a large number of packets. Another way to phrase this is that I would expect the number of packets to have an IP checksum ending with an “A” to be roughly the same as the number ending with an “F” or any other hex value. To test this, I isolated the IP checksums from a 16.6 million packet capture and pulled the frequency of the last nibble.

Success! The last byte of the IP checksum seemed to be distributed fairly evenly distributed across the traffic.

FILTER IMPACTS ON TRAFFIC PATTERNS

The next outcome to test was that that traffic would follow similar patterns irregardless of the filter I put in place, so I pulled the top ten port numbers across that same 16.6 million packets using each filter and got the following results:

What we see here is that we can get mathematically valid samples of network traffic by leveraging IP checksum filters. This method has proved valuable in surveying a busy network to understand what services are active because it allows you to pull a packet capture over a longer period of time without the file being massive and unwieldy.