$ /sbin/tc -V tc utility, iproute2-ss040831
This document describes the usage of PSPacer, which achieves precise
network bandwidth control and smoothing of bursty traffic without
any special hardware.
PSPacer is implemented based on the iproute2 which is the traffic
control framework for the Linux platform.
This framework provides three basic elements: queueing discipline (qdisc),
class, and filter [lartc].
PSPacer consists of two parts: a loadable kernel module and a user-level
library. The former (i.e. sch_psp) is a classful qdisc,
[A classful qdisc contains multiple classes.]
and the latter (i.e. q_psp.so) communicates with this module for control
and monitoring it via the tc (8) command.
PSPacer uses the IEEE 802.3x PAUSE frame as a gap between packets. Therefore, you can not use the PAUSE frame to stop transmission from the switch/router to which the system is connected. It is recommended to disable IEEE802.3x flow control function of the network switch (to which a PC with PSPacer is connected) in order to avoid unexpected behavior. For your information, see our PFLDnet2005 paper [PFLDnet05]. and the Web page http://www.gridmpi.org/gridtcp.jsp.
We have tested this package using the FedoraCore 3 on IA32 machines with the Intel PRO/1000, and the FedoraCore 5 on x86_64 machines with Broadcom BCM5704. The complete list of tested environments is in the Appendix B.
|
Note
|
PSPacer version 2.x does not support the Linux kernel 2.4! |
The following source codes are required for the compilation:
Linux kernel 2.6.x
iproute2
The kernel re-compilation is not required, however the compilation of the sch_psp module requires the kernel source code. If you use the RedHat Linux based distribution, install the kernel-devel (or kernel-smp-devel) RPM. For your information, if you have built the custom kernel, make sure that the CONFIG_NET_SCHED and CONFIG_NET_CLS options have been selected.
Since the iproute2 package is tightly coupled with the kernel, you must use a proper combination. Make sure the current working version of iproute2 is available in your Linux box. To check the version, issue the following command:
$ /sbin/tc -V tc utility, iproute2-ss040831
Extract the iproute2 archive file to /opt/iproute2 directory, or set the iproute2 source directory path to the IPROUTE_SRC variable in the tc/Makefile file. If you do not have the source code, get an iproute2 archive from the Web page [iproute2].
$ cd /opt $ wget http://developer.osdl.org/dev/iproute2/download/iproute2-2.6.9-ss040831.tar.gz $ tar zxvf iproute2-2.6.9-ss040831.tar.gz $ ln -s iproute2-2.6.9 iproute2
|
Note
|
In the FedoraCore 3 after “yum update”, the kernel and the iproute2 will have mismatching versions. You have to choose either downgrading the kernel to version 2.6.9 or upgrading the iproute2 to version ss050330 manually. |
To make and install this software, issue the following commands:
$ tar zxvf pspacer-<version>.tar.gz $ cd pspacer-<version> $ make $ su # make install
“make install” will install the following files. Please check these files.
/lib/modules/`uname -r`/kernel/net/sched/sch_psp.ko /usr/lib/tc/q_psp.so /usr/share/man/man8/tc-psp.8 /usr/share/man/ja/man8/tc-psp.8 <only if your locale is Japanese>
Alternatively you can install from a RPM package. To build the RPM package from the source tarball, issue the following command:
# rpmbuild -tb --target i686 pspacer-<version>.tar.gz
And then, install the RPM package.
# rpm -ivh /usr/src/redhat/RPMS/i686/pspacer-<version>.i686.rpm # rpm -q pspacer pspacer-<version>
[Experimental] For Debian GNU/Linux users, you can also install from a Debian package. To build the Debian package from the source tarball, issue as follows:
$ tar zxvf pspacer-<version>.tar.gz $ cd pspacer-<version> $ make deb
And then, install the Debian package.
# dpkg -i ../pspacer-<kernel version>_<version>_i386.deb # dpkg -l|grep psp ii pspacer-2.6.X- X.X-X PSPacer module for Linux (kernel ...
As an example, here we assume the following network configuration which connects three local area networks (LANs): the network A (192.168.1.0/24), the network B (192.168.2.0/24), and the network C (192.168.3.0/24).
192.168.1.1 192.168.1.2
+-------+ +-------+
| PC1 | | PC2 |
+-------+ +-------+ (network A)
| | 192.168.1.0/24
----------------------------------
|
\
\ RTT=200ms
|
link 1 +-----------------+ link 2
(500Mbps) | | (200Mbps)
| |
----------- -------------------
192.168.2.0/24 | | 192.168.3.0/24
(network B) | | (network C)
+-------+ +-------+
| PC3 | | PC4 |
+-------+ +-------+
192.168.2.1 192.168.3.1
On each LAN, PCs are connected by Gigabit Ethernet, and then the bottleneck link's bandwidth of the link 1 and the link 2 are 500 Mbps and 200 Mbps, respectively. Both RTTs are 200 ms. In addition, we change the txqueuelen value to 10000 as follows (see Interface queue size):
# /sbin/ifconfig eth0 txqueuelen 10000
To classify traffic toward each others' network, we will create a traffic control tree which has three leafs shown below:
1: root qdisc (psp)
/ | \
/ | \
/ | \
/ | \
1:1 1:2 1:3 leaf classes
[500Mbit][200Mbit][NO GAP]
| | |
| | |
10: 20: 30: sub qdiscs (pfifo)
Each class has its classid and each qdisc has its handle. A classid or a handle consists of two parts, a major number and a minor number: [major]:[minor]. It is customary to name the root qdisc "1:", which is equal to "1:0". The minor number of a qdisc is always 0.
In this example, traffics of the link 1 and 2 are controlled by qdiscs 10: and 20:, respectively, and other (unclassified) traffics are controlled by qdisc 30:. Here qdisc 10: and 20: correspond to the leaf class 1:1 and 1:2 respectively, and qdisc 30: corresponds to the leaf class 1:3, which is the default class. Default class is a class to which packets are routed if there is no filter specifications. And then each class except the default class has the target rate for pacing. In this example, the target rate of the link 1 and 2 should be regulated to 500 Mbps and 200 Mbps, respectively.
You can set/get parameters to the sch_psp qdisc via the tc (8) command. The general information of the tc command can be found in the man page and the Web page [lartc]. The man page of tc-psp is also included in this package.
Each interface has one root qdisc. To use PSPacer, the root qdisc should be the sch_psp. First, to add the sch_psp as the root qdisc, and set the default class as 1:3, issue the following command:
# /sbin/tc qdisc add dev eth0 root handle 1: psp default 3
Next, to add leaf classes to the root qdisc, issue the following commands:
# /sbin/tc class add dev eth0 parent 1: classid 1:1 psp rate 500mbit # /sbin/tc class add dev eth0 parent 1: classid 1:2 psp rate 200mbit # /sbin/tc class add dev eth0 parent 1: classid 1:3 psp mode 0
Next, we add sub qdiscs to the leaf classes. Here we use PFIFO qdisc. This is the default qdisc which has 3-band FIFO queue. Unless you want to do something special with another qdisc, use PFIFO.
To add PFIFO (sub) qdiscs within each leaf class, issue the following
# /sbin/tc qdisc add dev eth0 parent 1:1 handle 10: pfifo # /sbin/tc qdisc add dev eth0 parent 1:2 handle 20: pfifo # /sbin/tc qdisc add dev eth0 parent 1:3 handle 30: pfifo
Filters are used to classify traffic into the proper sub-qdiscs. Finally, to attach two filters to root qdisc, issue the following commands:
# /sbin/tc filter add dev eth0 parent 1: protocol ip pref 1 u32 match ip dst 192.168.2.0/24 classid 1:1 # /sbin/tc filter add dev eth0 parent 1: protocol ip pref 1 u32 match ip dst 192.168.3.0/24 classid 1:2
Here we use the u32 filter in order to classify traffic by destination IP addresses. Use u32 unless you want to do something special using another filter. By issuing these commands, packets whose destination are 192.168.2.0/24 are controlled by qdisc 1:1 while packets whose destination are 192.168.3.0/24 are controlled by qdisc 1:2.
Now, the configuration has been completed. To print the current parameter settings and statistics, issue the show option of the tc command. The -s option means statistics, and -d option means detail.
Let's examine what we have created. First, you can see the root “psp” qdisc and three “pfifo” sub-qdiscs:
# /sbin/tc -s -d qdisc show dev eth0 qdisc psp 1: default 3 direct pkts 0 max rate 1000Mbit gap 0 bytes 0 pkts Sent 0 bytes 0 pkts (dropped 0, overlimits 0 requeues 0) qdisc pfifo 10: parent 1:1 limit 10000p Sent 0 bytes 0 pkts (dropped 0, overlimits 0 requeues 0) qdisc pfifo 20: parent 1:2 limit 10000p Sent 0 bytes 0 pkts (dropped 0, overlimits 0 requeues 0) qdisc pfifo 30: parent 1:3 limit 10000p Sent 0 bytes 0 pkts (dropped 0, overlimits 0 requeues 0)
Here “max rate” shows the maximum transmission bandwidth of the network interface.
Next, you can see three “psp” classes:
# /sbin/tc -s -d class show dev eth0 class psp 1:1 root leaf 10: level 0 mode STATIC (500Mbit) Sent 0 bytes 0 pkts (dropped 0, overlimits 0 requeues 0) class psp 1:2 root leaf 20: level 0 mode STATIC (200Mbit) Sent 0 bytes 0 pkts (dropped 0, overlimits 0 requeues 0) class psp 1:3 root leaf 30: level 0 mode NOGAP Sent 0 bytes 0 pkts (dropped 0, overlimits 0 requeues 0)
Finally, you can see two “u32” filters:
# /sbin/tc filter show dev eth0 filter parent 1: protocol ip pref 1 u32 filter parent 1: protocol ip pref 1 u32 fh 800: ht divisor 1 filter parent 1: protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2 match 00110000/00ff0000 at 8 filter parent 1: protocol ip pref 1 u32 fh 800::801 order 2049 key ht 800 bkt 0 flowid 1:3 match 00060000/00ff0000 at 8
You can confirm the default class configuration by using ping. Here ping generates unclassified traffic.
# ping 192.168.1.2 PING 192.168.1.2 56(84) bytes of data. 64 bytes from 192.168.1.2: icmp_seq=0 ttl=64 time=0.231 ms 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.214 ms 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.244 ms 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.149 ms 64 bytes from 192.168.1.2: icmp_seq=4 ttl=64 time=0.050 ms
--- 192.168.1.2 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4003ms rtt min/avg/max/mdev = 0.050/0.177/0.244/0.073 ms, pipe 2
# /sbin/tc -s -d qdisc show dev eth0
(snip)
qdisc pfifo 30: parent 1:3 limit 10000p
Sent 476 bytes 6 pkts (dropped 0, overlimits 0 requeues 0)
As you can see, all traffic to 192.168.1.2 is transmitted through qdisc 30:, which is the qdisc under the default (unclassified) class.
Next, we do some bulk data transfer by using iperf:
$ iperf -c 192.168.2.1 -i 10 -t 60 -w 32M ------------------------------------------------------------ Client connecting to 192.168.2.1, TCP port 5001 TCP window size: 64.0 MByte (WARNING: requested 32.0 MByte) ------------------------------------------------------------ [ 3] local 192.168.1.1 port 32771 connected with 192.168.2.1 port 5001 [ 3] 0.0-10.0 sec 463 MBytes 388 Mbits/sec [ 3] 10.0-20.0 sec 571 MBytes 479 Mbits/sec [ 3] 20.0-30.0 sec 572 MBytes 480 Mbits/sec [ 3] 30.0-40.0 sec 572 MBytes 480 Mbits/sec [ 3] 40.0-50.0 sec 553 MBytes 464 Mbits/sec [ 3] 50.0-60.0 sec 572 MBytes 480 Mbits/sec [ 3] 0.0-60.1 sec 3.23 GBytes 461 Mbits/sec
# /sbin/tc -s -d qdisc show dev eth0
qdisc psp 1: default 3 direct pkts 0 max rate 1000Mbit
gap 1718967000 bytes 1145978 pkts
Sent 1735006166 bytes 1145979 pkts (dropped 409155, overlimits 0 requeues 0)
qdisc pfifo 10: parent 1:1 limit 10000p
Sent 1735006166 bytes 1145979 pkts (dropped 409155, overlimits 0 requeues 0)
(snip)
As you see, all traffic is via qdisc 10:, and then the iperf throughput is about 480 Mbps. This rate is the half of the physical bandwidth, thus the amount of gap packets is about the same amount of real packets.
|
Note
|
PSPacer uses raw bandwidth (i.e. data link layer bandwidth) which includes IP, TCP or UDP headers as the target rate. On the other hand, iperf shows payload bandwidth. Therefore, iperf shows lower bandwidth than the target rate (see Why iperf shows bandwidth lower than the target rate which set by tc command). |
Some parameters can be modified in place. To change parameters, issue the following command:
# /sbin/tc class change dev eth0 parent 1: classid 1:2 psp mode 0 # /sbin/tc class change dev eth0 parent 1: classid 1:3 psp rate 250mbit
The following is an example of a class removal:
# /sbin/tc class del dev eth0 parent 1: classid 1:2
The following is an example of a qdisc removal:
# /sbin/tc qdisc del dev eth0 root psp
If you remove the root qdisc, all classes and qdiscs are also removed.
To install and uninstall the sch_psp module, issue the following commands, as super user:
# /sbin/modprobe sch_psp # /sbin/rmmod sch_psp
The current implementation does not support TSO. You must disable TSO by using the ethtool (8) command:
# /sbin/ethtool -K eth0 tso off
R.Takano, T.Kudoh, Y.Kodama, M.Matsuda, H.Tezuka, and Y.Ishikawa, “Design and Evaluation of Precise Software Pacing Mechanisms for Fast Long-Distance Networks,” In 3rd Intl. Workshop on Protocols for Fast Long-Distance Networks (PFLDnet05), February 2005.
IP routing utilities (iproute2), http://developer.osdl.org/dev/iproute2/.
Linux Advanced Routing & Traffic Control, http://lartc.org/howto/
The system should have capability to transmit packets at the maximum rate of the connected network link to realize precise pacing.
If your PC is cabled by Gigabit Ethernet, your network interface should be connected to the PC by PCI-Express, PCI-X, 66MHz/64bit PCI, or CSA. Otherwise, the I/O bus becomes a bottleneck, and maximum transmission rate can not be achieved. The processor itself may have enough performance if it is Pentium 4 or above. If your PC is cabled by Fast Ethernet, most of today's PCs may have enough performance.
This is an independent matter from PSPacer. Here are some tips.
Modify kernel parameters using sysctl(8) command. To change TCP settings, add the entries below to the file /etc/sysctl.conf, and then run "sysctl -p".
The optimal socket buffer size is twice the size of the bandwidth * delay product (BDP) of the link. However, the default max TCP buffer size is too small on long fat networks. For example, to fill a 125MB/s (i.e. 1Gbps) pipe with 100ms one-way latency requires 125 * 0.2 = 25MB of data. To change the socket buffer size, do the following:
# increase max TCP buffer size to 32MB (32 * 1024 * 1024) net.core.rmem_max = 33554432 net.core.wmem_max = 33554432
|
Note
|
Linux doubles the requested buffer size. So the above setting means that the max TCP buffer size is set to 64MB. |
Another thing you can try is to increase the size of the interface queue (i.e. the transmit queue of the network interface). To do this, issue the following using ifconfig (8) command:
# ifconfig eth0 txqueuelen 10000
netdev_max_backlog specifies the maximum length of the input queues for the processors. The default value is 300 (packets). Linux has to wait up to scheduling time to flush buffers (due to bottom half mechanism). Therefore, this value can limit the network bandwidth when receiving packets as follows:
300 * 1,000 = 300,000 packets HZ packets/s
300,000 * 1,000 = 300 M packets averate (bytes/packet) throughput (Bytes/s)
If you want to get higher throughput, you need to increase netdev_max_backlog:
net.core.netdev_max_backlog = 2500
When you run benchmarks and other experiments, it is often useful not to save the route metrics (cwnd, rtt, etc) from each connection. You can turn off caching the metrics:
net.ipv4.tcp_no_metrics_save = 1
The same effect can be achieved by executing this before each TCP session:
# echo 1 > /proc/sys/net/ipv4/route/flush
PSPacer uses raw bandwidth (i.e. data link layer bandwidth) which includes IP, TCP or UDP headers as the target rate. On the other hand, iperf shows payload bandwidth. Therefore, iperf shows lower bandwidth than the target rate.
In addition, especially when TCP/IP is used, the bandwidth may be regulated by congestion control or buffer size, etc. See "How can I achieve Gigabit speeds on Long Fat Networks using TCP/IP?".
Here we call a Linux PC on which a PSPacer installed a PSP box. The example below uses a PSP box to smooth traffic from LAN (1Gbps) to WAN (100Mbps, 20ms). Network Configuration
The network configuration is as follows:
+-------+
| PC1 | 192.168.1.2
+-------+
| 192.168.1.0/24
----------------------------------
|<-- 1Gbps (LAN)
|
| eth0 192.168.1.1
+-----------+
| |
| PSP Box |
| |
+-----------+
| eth0 192.168.2.1
|
|<-- 100Mbps, 20ms (WAN)
----------------------------------
| 192.168.2.0/24
+-------+
| PC2 | 192.168.2.2
+-------+
If PC1 and PC2 are Linux boxes, configurations are as below. Of course you can use other network connected PCs or equipments.
PC1: # ifconfig eth0 192.168.1.2 # route add -net 192.168.2.0 netmask 255.255.255.0 gw 192.168.1.1
PC2: # ifconfig eth0 192.168.2.2 # route add -net 192.168.1.0 netmask 255.255.255.0 gw 192.168.2.1
To enable IP forwarding, issue the following command:
# echo 1 > /proc/sys/net/ipv4/ip_forward
In the PSP box, issue commands as follows:
# tc qdisc add dev eth0 root handle 1: psp default 2 # tc class add dev eth0 parent 1: classid 1:1 psp rate 100mbit # tc class add dev eth0 parent 1: classid 1:2 psp mode 0 # tc qdisc add dev eth0 parent 1:1 handle 10: pfifo # tc qdisc add dev eth0 parent 1:2 handle 20: pfifo # tc filter add dev eth0 protocol ip parent 1: pref 1 u32 match ip dst 192.168.1.0/24 classid 1:1
Table.1 and Table.2 show tested environments which we have tested.
| Distribution | Kernel | iproute2 |
|---|---|---|
| FedoraCore 3 | 2.6.9 | ss040831 |
| FedoraCore 5 | 2.6.16 | ss060110 |
| Debian GNU/Linux sid | 2.6.12.2 | ss041019 |
| Name | Bus |
|---|---|
| Intel PRO/1000 | PCI-X, 64bit PCI, CSA |
| Intel PRO/100 | 32bit PCI |
| Broadcom BCM5704 | PCI-X |
Here we show some experimental results in a WAN emulation environment. In these experiments, a hardware network testbed, GtrcNET-1, emulates 200ms RTT latency. We measured the the bandwidth generated by iperf with a milli-second resolution.
+-----+
| TCP | (target rate 200 Mbps)
+-----+-----+-----+
| TCP | (target rate 250 Mbps)
+-----+-----------------+-----+
| |
| TCP | (target rate 500 Mbps)
| |
--+-----------------------------+--->
0 10 20 40 50 60 (sec)
+-----+
| UDP | (CBR 300 Mbps)
+-----+-----+-----+
| UDP | (CBR 300 Mbps)
+-----+-----------------+-----+
| |
| TCP | (target rate 500 Mbps)
| |
--+-----------------------------+--->
0 10 20 40 50 60 (sec)