Internet Quality of service (QoS)

This text aims to introduce the subject "Quality of Service" (QoS) for the Internet, with some focus on Linux QoS tools. The authoritative documentation is still the good old LARTC HOWTO, but it is quite outdated. This article from Comparitech has more directions. Also, I mention some novel features not scribbled on LARTC yet.

Is it possible to do QoS on TCP/IP?

The short answer is "No". The only thing you can do regarding the Internet is to control your own packet transmission rate. You can't even control the receiving rate. You may ignore them, but it is too late by then; the packets have already eaten your bandwidth.

Actually, it is very, very easy to bog down someone's Internet pipe: just ask for half a dozen friends to ping-flood the victim (a simple ping -f -s 4096 victim_address does wonders!).

Moreover, the TCP protocol has the natural trait of using all the available bandwidth. In theory, several concurrent TCP connections will share the available band. In practice, it is more correct to say they will fight for the band. In particular, new connections will struggle to make way. Everybody who's struggled to access a site while doing big downloads knows the issue.

(Message from 2016 to a 2004 article: nowadays even consumer-grade routers do some sort of QoS, and consumer pipes are much faster than they were in 2004, so the problem is not as visible it was back then. But a slow, shared 3G pipe may still act up.)

The application could control the data rate, but only a small minority actually does this. Since TCP/IP does not offer direct mechanisms to discover or even estimate the pipe speed, it is quite difficult for an app to exert self-control and use e.g. just one third of the available band.

In order to implement true QoS for the Internet, all intermediate routers would need to support it. At the very least the routers in front of bottlenecks like an international pipe of an ISP. If you create a connection between Brazil and New Zealand and wants a minimum guaranteed band of 100kbps, all routers in the way would have to guarantee 100kbps. The guarantee goes both ways: we get the necessary bandwidth and our connection does not try to use more.

It would be perfectly possible to implement this kind of QoS on the current Internet. The necessary protocols already exist (e.g. RSVP) and router manufacturers do offer models that support them. The thing stumbles on a simple question: who's gonna pay the bill? Everybody wants QoS but nobody wants to pay extra. OTOH if it were free, it would be abused.

So, what we can do is to build QoS strategies at our endpoint, based solely on shaping our own transmission rates.

But my Internet works well, why bother?

Given the characteristics and limitations of TCP/IP, I am genuinely surprised it works well most of the time. Even services like voice, video and conference are used successfully around the world by millions or billions of persons round the globe. How is it possible?

This is due, in no small part, to the TCP's congestion control. It is the "antidote" against the tendency of using all available bandwidth. At the first sign of packet loss (or ECN flagged, in recent systems), TCP cuts sharply the rate of transmission.

Besides avoiding the general seizure of the whole Internet, this TCP feature opens the door for some QoS tricks, as we will see.

Another lesser-known feature of TCP/IP is the TOS bitmap (Type Of Service). It is a small part of the IP header which gives a hint of how this packet can be prioritized by routers. (Two original bits of the 8-bit TOS field are now employed by ECN.)

Even though no intermediate router is bound to pay attention to these bits (they can be abused, after all), most apps, routers and operating systems respect them and try to prioritize traffic accordingly. Again, this applies to upstream traffic, whose queue our router can control.

But what is QoS, after all?

An informal definition would be something like: several services using the Internet pipe, each service has different demands, and all of them can carry out their duties. Fast downloads, VoIP without cuts, SSH and VNC sessions with very low latency, everything together.

Some objective parameters of the QoS:

Bandwidth: many services just need a minimum band to function, e.g. VoIP, videoconference. Other services e.g. bulk downloads may use whatever band is left.

Latency: time that a packet takes to reach destiny. High latency hinders on-line games, VNC and SSH sessions, and may be bad for voice and live video. If it's really high, opening a simple Web page takes a long time and this is very annoying.

Very often, a latency issue is perceived as "lack of bandwidth". The first ADSL links had just twice the dial-up band, but they felt like ten times as fast, just because its latency is smaller (RTT was 30ms for ADSL and 140ms for dial-up).

Jitter: It is the variation of latency. It is a problem for live services like VoIP or video conferences, though it can create problems for every TCP connection since TCP assumes that link latency is stable. A packet that takes too long to arrive is deemed lost, and TCP overreacts to packet loss.

Most VoIP and video applications use a "jitter buffer" as a cushion against jitter, trading an uncertain latency (that could cause drop-outs) for a higher constant latency. This must be weighed very well, because long delays tend to confuse users.

Fairness: all connections and services get their fair share of the band. Services of the same class and priority should get equal parts of the reserved resources.

How to ruin the day of your fellow Internet users

Without any Internet QoS, spoiling the Internet is just a matter of starting a handful of big-file downloads. Linux distro ISOs are a good choice. That's what happens next:

a) The downloads will speed up until they take all the available bandwidth.

b) Probably your Internet link is the bottleneck between your computer and the server, therefore the ISP's router nearest to you begins to queue packets. For ADSL links, this queue can be very long, up to 5 seconds worth of traffic.

c) If any other user tries to open a new connection, the SYN+ACK response packet goes to the end of that queue. The connection will take 10 seconds to open instead of the usual 0.03.

d) If any user tries to use VoIP, he will endure the 5-second long queue and the counterpart's voice will take this time to arrive. A lot worse than a international phone call.

e) Once the queue is full, the ISP router finally gives up and starts to drop packets. Your rogue downloads will throttle down in response to this. For just a moment, the Internet looks ok again... and the cycle repeats. The ISP router queue size varies a lot, and the latency goes along, increasing the jitter.

Since you fired many concurrent downloads, they will fight each other for band. The excess traffic of one download will cause the packet drop of another one. The queue size is rendered even more unstable, spoiling your coworker's Internet for good.

Until know, only the download was spoiled. The upload was completely free. In order to "solve" this, start up a couple Torrent downloads of very popular movies and/or pr0n. Leave all upload/download controls disabled. Then,

a) The torrent starts to send data upstream until it fills the available upload bandwidth. This time, your own local router starts to queue packets.

b) Now the other users endure two queues: one upstream (at your router), another downstream (at ISP's router). The latency may double, let's not even talk about jitter.

Dilbert's Wally could not have done better. And everybody blames PHB because "he doesn't want to spend on a better Internet".

How to defend your local network against rogue users

Now it's the black's turn. The network admin or the router manufacturer strikes back.

To be honest, there is no surefire defense against an inside user that is dead set to wreck your Internet access. A simple and efficient measure is to police traffic with some tool like ntop, and allow everyone to query the records. Peer pressure will do the rest.

But, as often as not, the users misuse the net without even knowing it. Big downloads carried out automatically by computers and smartphones are beyond the control of the user as well. So, we must be proactive.

Let's imagine a very simple network with a NAT router:

local network  +----------+
---------------|eth0      |
               |  router  |          +------------+
	       |      ppp0|----------| ISP router |----> Internet
	       +----------+   ADSL   +------------+

As we discussed before, the major goal is to avoid long queues in the routers. Upload-wise, the queue is in our own router, since the ADSL link is the bottleneck. Download-wise, the queue is in our ISP's router, which we cannot control directly, and it is our biggest problem.

To avoid the latter queue, we can create an artificial bottleneck at eth0 output. The bandwidth is restricted to a smaller value than the ADSL. Any excess packets should be discarded instead of queued.

The (artificially) lost packets will force the TCP connections to enter in congestion control mode and slow down. There are no hard guarantees but, with a bit of luck, the system (routers + connections) will settle in a bandwidth level that hardly ever loses a packet at eth0, and never queues up the ISP router.

Choosing the eth0's bandwidth takes a bit of experimentation. The smaller the band is, the less the chance of packets queueing at ISP router, and the fastest the link recovers from congestion. It would be great to have a 100Mbps link and throttle it to 50Mbps...

This is an unavoidable characterstic of QoS for TCP/IP: in order to improve one metric, we need to give something in exchange. Generally, we trade bandwidth for latency.

In case of a 2Mbps ADSL link, a throttle of 1.6Mbps yields good results. We are not losing 20% of the link, but rather 10% because ADSL has notoriously high overhead (ATM, PPPoE). The mileage will vary for other link types.

Now, let's implement this throtle using a Linux script.

DOWN=1600
tc qdisc del dev eth0 root    2> /dev/null > /dev/null

# Downstream discipline at eth0. Not all eth0 traffic should
# be disciplined; only packets from the Internet. 

tc qdisc add dev eth0 root handle 1: htb r2q 100

# Class with a throtlle slightly smaller than ADSL
tc class add dev eth0 parent 1: classid 1:20 htb \
	rate ${DOWN}kbit ceil ${DOWN}kbit

# SFQ for fairness between connections
tc qdisc add dev eth0 parent 1:20 handle 20: sfq perturb 11

# Now, we label or mark the packets that should be controlled
by the above rules. Criteria: mark packets that come from
# Internet-facing interfaces 

iptables -t mangle -F PREROUTING
ip6tables -t mangle -F PREROUTING
iptables  -A PREROUTING -t mangle -i ppp0 -j MARK --set-mark 20
ip6tables -A PREROUTING -t mangle -i sixxs -j MARK --set-mark 20

# Put packets marked with 20 in the SFQ queue (1:20)

tc filter add dev eth0 protocol ip parent 1:0 \
	prio 1 handle 20 fw flowid 1:20
tc filter add dev eth0 protocol ipv6 parent 1:0 \
	prio 2 handle 20 fw flowid 1:20

# Query statistics:
# tc -s -d qdisc show dev eth0
# tc -s -d class show dev eth0
# tc -stats qdisc show dev eth0

This rather long script has the sole purpose of throttling the traffic from Internet to local network. It does not control the "ingress" traffic, that is, the download traffic whose final destination is the router itself. Note the throttling is at eth0 egress (output), not at ppp0's ingress (input). Remember the mantra: we can't control what we receive, we just control what we transmit.

To be fair, the script does one more thing: it enforces fairness between concurrent connections. This is possible because we are in control of eth0's queue. Let's revisit two tc commands that are the heart of the whole script:

tc class add dev eth0 parent 1: classid 1:20 htb \
	rate ${DOWN}kbit ceil ${DOWN}kbit
tc qdisc add dev eth0 parent 1:20 handle 20: sfq perturb 11

The first command creates a traffic class. That's where we throttle the bandwidth. The second command creates a queue discipline. The adopted discipline is SFQ (stochastic fairness queue). It creates a sub-queue per connection (not exactly, but don't tell anyone!) and allows each sub-queue to send one packet per turn.

In a pristine Linux computer, without any explicit QoS configuration, the queue discipline QFIFO_FAST is default. It takes the TOS bits into account, so QoS is always there to some degree.

The interface-class-discipline tree is define top-down, but an actual network packet flows bottom-up. Once the routing decision is made, it is added to some queue discipline, and there it stays, waiting. There may be several queues; the "tc filter" command can steer each packet to the desired queue.

Once the network interface is free to send a packet, it requests one to the "root" class, which in turn requests to the class we have created. The class only releases packets accordingly to the bandwidth rate. When it is ready, it requests a packet from the queue (or from a subclass, if there are subclasses). The first packet in queue is delivered, and it bubbles up to the interface.

Well, we have handled the download problem. Now we need to handle the upload. The script below implements the upload control on ppp0 interface:

UP=320
tc qdisc del dev ppp0 root
tc qdisc add dev ppp0 root handle 1: htb default 1 r2q 20
tc class add dev ppp0 parent 1: classid 1:1 htb rate ${UP}kbit \
	ceil ${UP}kbit
tc qdisc add dev ppp0 parent 1:1 handle 10: sfq perturb 10

As we did for eth0, we defined again a bandwidth-limiting class, and a queue discipline that implements fairness.

One could argue that we don't need to limit the upstream bandwidth, since it is naturally throttled by the ADSL link. The problem is, our Linux network interface has a queue as well. It can be very long (up to 1000 packets) and outside our control, just like that ISP router queue. Controlling the bandwidth using classes guarantees that each packet is immediately transmitted as soon as the class releases it.

By keeping the routers free of long queues, the latency stays low, the jitter very low, and new connections are established very fast.

The only missing piece is to reserve bandwidth for services that need guaranteed rates, like voice over IP. We do this by adding sub-classes (something that someday I will add to this very article. But don't hold your breath.)

Sometimes our Linux router runs network services as well (FTP, Web), and our previous scripts can't control the download bandwidth whose destination is the router itself. We need to add one more control, the "ingress":

DOWN_PRE=1750
tc qdisc del dev ppp0 ingress
tc qdisc add dev ppp0 handle ffff: ingress
tc filter add dev ppp0 parent ffff: protocol ip prio 50 \
	u32 match ip src \
  0.0.0.0/0 police rate ${DOWN_PRE}kbit burst 20k drop flowid :1

It must be said that the ingress control is a gimmick. Just rate control, with no classes and no fairness. (Read on to know a better way.) This pre-filter must allow a slightly bigger bandwidth to pass (in the example, 1750kbps vs. 1600kbps at eth0). Othewise, the throttle in eth0 is no longer the major bottleneck of the whole network, and it ceases to function.

There is a better trick to impose QoS on ingress traffic: create a fake network interface, where packets "ingress" and "egress" before they are routed. This way, we have an egress point to hook our classes and queues.

At least two Linux modules do that: IMQ and IFB. IMQ is older, easy to understand, lots of people have used it, but it never made into the mainline kernel. It needs a pached version of iptables, so it is a bit messy to install. On the other hand, IFB is more recent, made into the kernel and can be found in your favorite disro.

I must say that, even though IFB does work, it is still a higher-order gimmick, and it is a pain to tune and test the scripts until everything works. Try to avoid running network services in a Linux router; use it as a pure router. QoS is difficult enough without the additional hassles of IFB.

Complete example: QoS script for router

This is the original article's example script. It may happen that "tc filter" commands fail in some recent distro because in some kernel versions each filter needs to have a different priority (just add "prio 1", "prio 2", etc. if needed).

#!/bin/sh

UP=320
DOWN_PRE=1750
DOWN=1600

# Clear old rules

tc qdisc del dev ppp0 root    2> /dev/null > /dev/null
tc qdisc del dev eth0 root    2> /dev/null > /dev/null
tc qdisc del dev ppp0 ingress 2> /dev/null > /dev/null

[ "$1" = "off" ] && exit 0

# Top discipline for Internet-facing network interface

# r2q is chosen such as band/r2q is just bigger than MTU.
# In the example, 320kb = 40kB/20 = 2000 > 1500 = ok

tc qdisc add dev ppp0 root handle 1: htb default 1 r2q 20

# Class 1:1, throttles band to the upstream link capacity
tc class add dev ppp0 parent 1: classid 1:1 htb \
	rate ${UP}kbit ceil ${UP}kbit

# For the class 1:1, the traffic discipline is SFQ
# for a fair distribution of band among connections
tc qdisc add dev ppp0 parent 1:1 handle 10: sfq perturb 10

# Rough throttling on downstream, the main throttling
# is attached to eth0. This limits download traffic
# for inbound (not routed) flow.

tc qdisc add dev ppp0 handle ffff: ingress
tc filter add dev ppp0 parent ffff: protocol \
  ip prio 50 u32 match ip src \
  0.0.0.0/0 police rate ${DOWN_PRE}kbit burst 20k drop flowid :1

# Downstream discipline, attached to eth0.
# Not all eth0 traffic should be disciplined, only the
# packets from Internet. Local traffic flows undisturbed.

tc qdisc add dev eth0 root handle 1: htb # r2q 100

# Class that throttles accordingly to ADSL link downstream.
tc class add dev eth0 parent 1: classid 1:20 htb \
	rate ${DOWN}kbit ceil ${DOWN}kbit

# Again ws use SFQ for improved queue management
tc qdisc add dev eth0 parent 1:20 handle 20: sfq perturb 11

# Now, we mark the traffic that should be under the above rule.
# Criteria: mark the packets coming from Internet-facing interfaces.

iptables -t mangle -F PREROUTING
ip6tables -t mangle -F PREROUTING
iptables  -A PREROUTING -t mangle -i ppp0 -j MARK --set-mark 20
ip6tables -A PREROUTING -t mangle -i sixxs -j MARK --set-mark 20
tc filter add dev eth0 protocol ip parent 1:0 handle 20 fw flowid 1:20
tc filter add dev eth0 protocol ipv6 parent 1:0 handle 20 \
	fw flowid 1:20

# Unmarked traffic does not show up in any class statistics
# except "root".

# tc -s -d qdisc show dev ppp0
# tc -s -d class show dev ppp0

NAS server script

This example script was not made for a router. It was put together for a NAS server. The goal is to avoid the NAS server to eat all our bandwidth while syncing with cloud storage, but the local network communication should flow with no restrictions.

This script makes use of some newer techniques that weren't available before: IFB for incoming traffic control, full IPv6 support in "tc" tool, packet marking with CONNMARK so all packets belonging to a connection keep the initial mark, even in face of IFB.

IF=eno1
UP=2000
DOWN=15000
# FIXME detect ISP-supplied IPv6 prefix automatically
IPV6_NET="2804:111:f230:a683::/64"

modprobe ifb numifbs=1
ip link set dev ifb0 up

iptables -t mangle -F
ip6tables -t mangle -F

# Marks packets accordingly to the connections they belong
# (otherwise, only the connection-opening packet is marked)

iptables -A POSTROUTING -t mangle -j CONNMARK --restore-mark
ip6tables -A POSTROUTING -t mangle -j CONNMARK --restore-mark
iptables -A PREROUTING -t mangle -j CONNMARK --restore-mark
ip6tables -A PREROUTING -t mangle -j CONNMARK --restore-mark

# Packets already labeled leave the chain as fast as possible

ip6tables -A POSTROUTING -o $IF -t mangle -m mark --mark 5 -j ACCEPT
ip6tables -A POSTROUTING -o $IF -t mangle -m mark --mark 1 -j ACCEPT
ip6tables -A PREROUTING -i $IF -t mangle -m mark --mark 5 -j ACCEPT
ip6tables -A PREROUTING -i $IF -t mangle -m mark --mark 1 -j ACCEPT
iptables -A POSTROUTING -o $IF -t mangle -m mark --mark 5 -j ACCEPT
iptables -A POSTROUTING -o $IF -t mangle -m mark --mark 1 -j ACCEPT
iptables -A PREROUTING -i $IF -t mangle -m mark --mark 5 -j ACCEPT
iptables -A PREROUTING -i $IF -t mangle -m mark --mark 1 -j ACCEPT

# Local network packets: mark "1"
# Packets from/to Internet: mark "5"

ip6tables -A PREROUTING -t mangle -i $IF -s $IPV6_NET \
	-j MARK --set-mark 1
ip6tables -A PREROUTING -t mangle -i $IF -s fe80::/64 \
	-j MARK --set-mark 1
ip6tables -A PREROUTING -t mangle -i $IF \
	-m mark ! --mark 1 -j MARK --set-mark 5

ip6tables -A POSTROUTING -t mangle -o $IF -d $IPV6_NET \
	-j MARK --set-mark 1
ip6tables -A POSTROUTING -t mangle -o $IF -d fe80::/64 \
	-j MARK --set-mark 1
ip6tables -A POSTROUTING -t mangle -o $IF \
	-m mark ! --mark 1 -j MARK --set-mark 5

# Same marks for ingress IPv4 packets

iptables -A PREROUTING -t mangle -i $IF -d 192.168.0.0/255.255.0.0 \
	-j MARK --set-mark 1
iptables -A PREROUTING -t mangle -i $IF -s 192.168.0.0/255.255.0.0 \
	-j MARK --set-mark 1
iptables -A PREROUTING -t mangle -i $IF -m mark ! --mark 1 \
	-j MARK --set-mark 5

# Same marks in egress IPv4 pakets

iptables -A POSTROUTING -t mangle -o $IF -d 192.168.0.0/255.255.0.0 \
	-j MARK --set-mark 1
iptables -A POSTROUTING -t mangle -o $IF -s 192.168.0.0/255.255.0.0 \
	-j MARK --set-mark 1
iptables -A POSTROUTING -t mangle -o $IF -m mark ! --mark 1 \
	-j MARK --set-mark 5

# Saves the mark for the respective connection

iptables -A POSTROUTING -t mangle -j CONNMARK --save-mark
iptables -A PREROUTING -t mangle -j CONNMARK --save-mark
ip6tables -A POSTROUTING -t mangle -j CONNMARK --save-mark
ip6tables -A PREROUTING -t mangle -j CONNMARK --save-mark

# Next commands limit the upload rate. Not much different
# from the other script. Packets labeled "5" are affected.

tc qdisc del dev $IF root    2> /dev/null > /dev/null
tc qdisc del dev $IF ingress 2> /dev/null > /dev/null
tc qdisc del dev ifb0 root    2> /dev/null > /dev/null
tc qdisc del dev ifb0 ingress 2> /dev/null > /dev/null

# r2q chosen by trial and error until dmesg shows no warnings
# it is in the same magnitude of the proportion between
# the bands allocated to classes 1:20 and 1:30

tc qdisc add dev $IF root handle 1: htb default 30 r2q 200

tc class add dev $IF parent 1: classid 1:20 htb \
    rate ${UP}kbit ceil ${UP}kbit
tc qdisc add dev $IF parent 1:20 handle 20: sfq perturb 10

tc class add dev $IF parent 1: classid 1:30 htb \
    rate 90mbit ceil 90mbit
tc qdisc add dev $IF parent 1:30 handle 30: sfq perturb 10

tc filter add dev $IF parent 1: protocol ip prio 1 \
	handle 5 fw flowid 1:20
tc filter add dev $IF parent 1: protocol ipv6 prio 2 \
	handle 5 fw flowid 1:20

# This time we will use the "ingress" hook to redirect
# packets to the fake interface ifb0 instead of controlling
# rate directly. Note the "connmark", "mirred" and "egress"
# keywords.

tc qdisc add dev $IF handle ffff: ingress
tc filter add dev $IF parent ffff: protocol ip \
	prio 5 u32 match u32 0 0 \
	action connmark action mirred egress \
	redirect dev ifb0
tc filter add dev $IF parent ffff: protocol ipv6 \
	prio 6 u32 match u32 0 0 \
	action connmark action mirred egress \
	redirect dev ifb0

# Traffic discipline attached to the fake interface ifb0.
# When we limit the "transmission" rate of this interface,
# we are actually limiting the ingress rate of $IF.
# Again, the packets labeled "5" are directed to the
# traffic shaping.
    
tc qdisc add dev ifb0 root handle 2: htb default 30 r2q 200
    
tc class add dev ifb0 parent 2: classid 2:20 htb \
    rate ${DOWN}kbit ceil ${DOWN}kbit
tc class add dev ifb0 parent 2: classid 2:30 htb \
	rate 90mbit

tc qdisc add dev ifb0 parent 2:20 handle 20: sfq perturb 10
tc qdisc add dev ifb0 parent 2:30 handle 30: sfq perturb 10

tc filter add dev ifb0 parent 2: protocol ip prio 10 \
	handle 5 fw flowid 2:20
tc filter add dev ifb0 parent 2: protocol ipv6 prio 11 \
	handle 5 fw flowid 2:20