Englisch – noteblok

A highly available WireGuard VPN setup

WireGuard is a communication protocol and free and open-source software that implements encrypted virtual private networks (VPNs), and was designed with the goals of ease of use, high speed performance, and low attack surface.

My Journey to WireGuard

I’ve been using it in my home lab setup since about 2020. When, in the end of 2021, it was finally merged into the Linux mainline with release 5.9, I started to replace my former Tinc-VPN setup with it. Tinc-VPN is another great open source VPN solution. Unfortunately, its development has stalled over the last years which motivated me to look for alternatives. In contrast to WireGuard, Tinc runs as a user-space daemon and uses tun/tap devices which adds a significant processing overhead. Like WireGuard, it is also using UDP for tunneling data, but falls back to TCP in situations where direct datagram connectivity is not feasible. Another big advantage of Tinc is its ability to form a mesh of nodes and to route traffic within it when direct P2P connections are not possible due to firewall restrictions. At the same time, this mesh is also used for facilitating direct connections by signaling endpoint addresses of NATed hosts.

This mesh functionality made Tinc quite robust against the failure of single nodes as usually we could route traffic via other paths.

Highly Available WireGuard server setup

To counteract this shortcoming, this blog post will present a highly available WireGuard setup using the Virtual Router Redundancy Protocol (VRRP) as implemented by the keepalived daemon.

That said, it is worth noting that this setup does will not bring back some of the beloved features of Tinc. Both meshing, the peer and and endpoint discovery features of Tinc are currently and will never be supported by WireGuard. Jason A. Donenfeld the author of WireGuard focused the design of WireGuard on simplicity, performance and auditability. Hence advanced features like the ones mentioned will only be available to WireGuard by additional agents / daemons which control and configure WireGuard for you. Examples for such are Tailscale, Netmaker and Netbird.

The setup presented in this post is a so called active / standby configuration consisting of two almost equal configured Linux servers running both WireGuard and the keepalived daemon. As the name suggest only one of those two servers will by actively handling WireGuard tunneling traffic while the other one stands by for the event of a failure or maintenance of the active node.

VRRP Wireguard Setup

Requirements

Before get started some requirements for the setup:

2 Servers running Linux 5.9 or newer.
A working Wireguard configuration.
A local L2 network segment two which both servers are connected.
Upstream connectivity without NATing via gateway connected to the network segment (usually provided by your internet or hosting provider).
An unused address to be used as Virtual IP (VIP) which roamed between the two servers by VRRP.

An important point is here the assumption that we are running both servers in the same switched network segment as this is a requirement for VRRP. We are also assuming that the upstream gateway performs no NATing. This guide covers only IPv6 addressing. However all steps can be also adapted or repeated for a dual stack or IPv4-only setup.

Detailed steps

Here are some of the specifics for my setup which need to be adapted by you:

Server Key (same use by both servers)
- Private: YIEDx+A2ONo5+uv3mrk/p7ileL3T5QQ8hleQK0IYEEI=
- Public: XGubrkGtuECdvoykKeUiNMigk2onfLCPfEo9Im+vmx8=
Peer Key (In this example we only have a single peer)
- Private: OIbpWVIVVBOtWfwkmXkFRN7Q/jBdfYtsGt7j97aHx1Q=
- Public: 3NGl6gTOGs6ai0RE91VmVFgF+N4gw1EBG11KOeiKJAg=
Public Server Subnet: 2001:DB8:1::/64
- Gateway: 2001:DB8:1::1
- Virtual IP: 2001:DB8:1::2
- Server A: 2001:DB8:1:::3
- Server B: 2001:DB8:1::4
WireGuard Tunnel Subnet: 2001:DB8:2::1/64
- Server: 2001:DB8:2::1 (same used by both servers)
- Peer: 2001:DB8:2::2
Interface names
- Wireguard: wg1
- Upstream: eno1

1. Prepare servers

We start of preparing the two servers by installing WireGuard and keepalived:

sudo apt install keepalived wireguard-tools iproute2

Next we configure a WireGuard interface on both servers using exactly the same configuration file at /etc/wireguard/wg1.conf:

[Interface]
Address = 2001:DB8:2::1/64
PrivateKey = YIEDx+A2ONo5+uv3mrk/p7ileL3T5QQ8hleQK0IYEEI=
ListenPort = 51800

[Peer]
PublicKey = 3NGl6gTOGs6ai0RE91VmVFgF+N4gw1EBG11KOeiKJAg=
AllowedIPs = 2001:DB8:2::2/128
PersistentKeepalive = 25

Similarly, a reciprocal configuration file is needed on the client side which skip here for brevity. Before proceeding, we activate the interface on both servers:

systemctl enable --now wg-quick@wg1

wg show wg1 # Check if interface is up

2. Configuring Keepalived

Create a configuration file for keepalived at /etc/keepalived/keepalived.conf

global_defs {
    enable_script_security
    script_user root
}

# Check if the server the WireGuard interface configured
vrrp_script check_wg {
    script "/usr/bin/wg show wg1"
    user root
}

vrrp_instance wg_v6 {
    interface eno1
    virtual_router_id 52
    notify /usr/local/bin/keepalived-wg.sh

    state BACKUP # use BACKUP for Server B
    priority 99 # use 100 for Server B

    virtual_ipaddress {
	2001:DB8:1::1/64
    }

    track_script {
        check_wg
    }
}

Create a notification script for keepalived at /usr/local/bin/keepalived-wg.sh

#!/bin/bash

TYPE=$1
NAME=$2
STATE=$3
PRIO=$4

WGIF=wg1

case ${STATE} in
	MASTER)
		ip link set up dev ${WGIF}
		;;

	BACKUP|FAULT|STOP|DELETED)
		ip link set down dev ${WGIF}
		;;

	*)
		echo "unknown state"
		exit 1
esac

Now start the keepalived daemon:

chmod +x /usr/local/bin/keepalived-wg.sh
systemctl enable --now keepalived

4. Testing the fail over

In our configuration, Server A has a higher VRRP priority and as such will be preferred if both servers are healthy. To test our setup, we simply bring down the WireGuard interface on Server A and observe how the VIP gets moved to Server B. From the WireGuard peers perspective not much changes. In fact no connections will be dropped during the fail-over. Internally, the clients WireGuard interface renegotiate the handshake. However, that step is actually not observable by the user.

Run the following commands on Server A while alongside test the connectivity from the client side through the tunnel via ping -i0.2 2001:DB8:2::1:

# Check that keepalived has moved the VIP to interface eno1
ip addr show dev eno1

# Bring down the Wireguard interface
wg-quick down wg1

# Keepalived should now have moved the VIP to Server B
ip addr show dev eno1

Going further

In my personal network, I operate a Interior Gateway Protocol (IGP) to dynamically route traffic within and also towards other networks. Common IGPs are OSPF, ISIS or BGP. In my specific case, both Servers A & B run the Bird2 routing daemon with interior and exterior BGP sessions.

So how does the WireGuard HA setup interoperates with my interior routing? Quite well actually. As my notify script (keepalive-wg.sh) will automatically bring up/down the interface, the routes attached to the interface will be picked up by Bird’s direct protocol.

I am also planning to extend my WireGuard agent ɯice to support the synchronization of WireGuard interface configurations between multiple servers.

Background

Surprisingly, the setup works by using Keepalived and does not require any iptables or nftables magic to rewrite source IP addresses. I’ve seen some people mentioning that SNAT/DNAT would be required to convince WireGuard to use the virtual IP instead of the server addresses. However, in my experience this was not necessary.

Another concern has been that the backup Wireguard interface still might attempt to establish a handshake with its peers. This would quite certainly interfere with the handshakes originated by the current master server. However, also this has not been proven to be the case. I assume the fact that our notify script brings down the WireGuard interface on the backup server causes them to cease all communication with its peers.

Having a detailed look at the Netgear Nighthawk M5 Mobile LTE/Router

After gaining root access to the device in the first post of this series, we will have a closer look at the device and its firmware.

This post is documenting some internals of the device which is not the most exciting stuff to read. I mainly collected it here for documentation purposes.

All information in this post has been collected from a device running firmware version NTGX55_12.04.12.00.

Software

Netgear’s firmware is Linux-based and uses quite a lot of common open-source tools. They provide all modifications to GPL licensed code via their support area: NETGEAR Open Source Software for Programmers.

From what I can tell only their user interface and configuration management is developed by Netgear themself apart from a bunch of binary blobs provided by Qualcomm which contains the modem firmware which gets loaded to the baseband processor.

One curiosity catched my eye: there is a running X server on the device. It is used by the front-panel display of the device. A custom application developed by Netgear uses Webkit’s engine to render the touch screen interface which just like the web UI is based on HTML and Javascript.

Here is an almost complete list of open source software components which I found on the device:

atk (v2.28)
Avahi (v0.7)
bash (v4.4.23)
base-files (v3.0.14)
BusyBox (v1.29.3)
conntrack-tools (v1.0.1)
D-Bus (v1.12.10)
ddclient (v3.8.1)
dhcpcd (v5.2.10)
DiG (v9.11.5-P4)
Dnsmasq (v2.85)
ethtool (v4.19)
font-config (v2.12.6)
freetype (v2.9.1)
glib (v2.58.0)
hostapd (v2.8-devel)
iproute2 (iproute2-ss140804)
iptables (v1.6.2)
iw (v4.14)
libcap (v2.25)
libnfnetlink (v1.0.0)
Linux Kernel (v4.14.117)
miniupnpd
mtd-utils (v2.0.2)
nettle (v3.4)
OpenSSL (v1.1.1b)
pimd (v2.1.8)
pppd (v2.4.7)
strace (v4.24)
SystemD (v239)
tinyproxy (v1.8.3)
util-linux (v2.32.1)
wireless-tools (v30)
wpa_supplicant (v2.9)
Xorg (v1.20.1)
xz (v5.2.4)

Basic facts

Lets first have a look at the Kernel version:

$ uname -a
Linux sdxprairie 4.14.117 #1 PREEMPT Thu Aug 19 23:42:26 UTC 2021 armv7l GNU/Linux

/ # cat /proc/version
Linux version 4.14.117 (oe-user@oe-host) (clang version 6.0.9 for Android NDK) #1 PREEMPT Thu Aug 19 23:42:26 UTC 2021

Apparently the firmware has been built by Open Embedded as indicated by the kernel notice „oe-user„.

There is also a /target file lying around. I assume that „sdxprairie“ is Qualcomm’s name for the SDK/BSP which is used by Netgear.

$ cat /target
sdxprairie

The application processor of the Snapdragon X55 is a fairly low powered single-core ARM v7:

$ cat /proc/cpuinfo
processor       : 0
model name      : ARMv7 Processor rev 5 (v7l)
BogoMIPS        : 38.40
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x0
CPU part        : 0xc07
CPU revision    : 5

Hardware        : Qualcomm Technologies, Inc SDXPRAIRIE
Revision        : 0000
Serial          : 0000000000000000

With around 780 MB of RAM:

$ free -m
             total       used       free     shared    buffers     cached
Mem:           781        387        393          0          0        142
-/+ buffers/cache:        245        535
Swap:          109          0        109

SoC details

Within the SysFS we can find some details about the SoC. More details about the meaning can be found in the Kernel documentation:

SysFS Entry	Value
/sys/devices/soc0/accessory_chip	0
/sys/devices/soc0/chip_family	0x5e
/sys/devices/soc0/chip_name	SDX55
/sys/devices/soc0/family	Snapdragon
/sys/devices/soc0/foundry_id	1
/sys/devices/soc0/hw_platform	MTP
/sys/devices/soc0/image_crm_version	:ntgrbc-fwbuild6
/sys/devices/soc0/image_variant	MAATANAZA
/sys/devices/soc0/image_version	00:BOOT.SBL.4.1-00231
/sys/devices/soc0/machine	SDXPRAIRIE
/sys/devices/soc0/ncluster_array_offset	0xb0
/sys/devices/soc0/ndefective_parts_array_offset	0xb4
/sys/devices/soc0/nmodem_supported	0xffffffff
/sys/devices/soc0/nproduct_id	0x410
/sys/devices/soc0/num_clusters	0x1
/sys/devices/soc0/num_defective_parts	0xd
/sys/devices/soc0/platform_subtype	Invalid
/sys/devices/soc0/platform_subtype_id	5
/sys/devices/soc0/platform_version	65536
/sys/devices/soc0/pmic_die_revision	131072
/sys/devices/soc0/pmic_model	65568
/sys/devices/soc0/raw_device_family	0x6
/sys/devices/soc0/raw_device_number	0xb
/sys/devices/soc0/raw_id	207
/sys/devices/soc0/raw_version	2
/sys/devices/soc0/revision	2.0
/sys/devices/soc0/select_image	0
/sys/devices/soc0/serial_number	27453XXXX
/sys/devices/soc0/soc_id	357
/sys/devices/soc0/vendor	Qualcomm

$ cat /sys/devices/soc0/images
0:
        CRM:            00:BOOT.SBL.4.1-00231
        Variant:        MAATANAZA
        Version:        :ntgrbc-fwbuild6
1:
        CRM:            01:TZ.FU.5.9-00147
        Variant:        EATAANBAA
        Version:        :CRM
11:
        CRM:            11:MPSS.HI.2.0.c3.5-00010-SDX55_CPEALL_PACK-1.403198.3
        Variant:        sdx55.gennatch.prod
        Version:        :ntgrbc-fwbuild6

Kernel command line

$ cat /proc/cmdline<br>noinitrd rw rootwait console=ttyMSM0,115200,n8 androidboot.hardware=qcom msm_rtb.filter=0x237 androidboot.console=ttyMSM0 lpm_levels.sleep_disabled=1 firmware_class.path=/lib/firmware/updates service_locator.enable=1 net.ifnames=0 atlantic_fwd.rx_ring_size=1024 pci=pcie_bus_perf rootfstype=ubifs rootflags=bulk_read root=ubi0:rootfs ubi.mtd=24 androidboot.serialno=105d0dc7 androidboot.baseband=msm

Kernel log

Unfortunately, I was not able to capture early kernel log messages. I assume those are only printed via a serial port and lost as the circular buffer for the kernel log has not been set up.

Full log: dmesg.txt

More details…

Network interfaces: interfaces.txt (ip address show)
Devices: devices.txt (ls -l /dev)
Kernel modules: modules.txt (lsmod)
Running processes: processes.txt (ps aux)
Mounted filesystems: mounts.txt (mount)
Flash partitions: mtd.txt (cat /proc/mtd)
Open ports: ports.txt (netstat -plnt)
Ubi devices & volumes: ubinfo.txt (ubinfo -a)
Kernel config: config.txt (zcat /proc/config.gz)
Devicetree source: device-tree.dts (dtc -I fs -O dts /proc/device-tree)

Feel free to contact me if I missed any particular detail which is interesting for you.

Thesis: Extended Abstract

Almost fourteen months ago, I started working on my bachelor thesis. Although I finished it half a year ago, it’s still part of my work as a student research assistant.

During my initial work, most of the code was written for an internal research kernel. I’m now happy that we were able to port it to an open source kernel called eduOS. This minimal operating system is used for practical demo’s and assignments during the OS course at my university. There’s much more I could write about. So this will probably be another separate blog post.

The motive for this article is an abstract I wrote for the student research competition of the ASPLOS conference which is held this year in Istanbul, Turkey. Unfortunately my submission got rejected. But as a nice side-effect, I’ve now the chance to present my work to an english audience as well:

PDF Version

Self-referencing Page Tables for the x86-Architecture

A simple Paging Implementation for a minimalistic Operating System

Steffen Vogel

Academic advisor: Dr. rer nat. Stefan Lankes
Institute for Automation of Complex Power Systems
E.ON Energy Research Center, RWTH Aachen University
Mathieustr. 10, 52074 Aachen, Germany

This was a submission for ASPLOS Student Research Competition ’15 Istanbul, Turkey¹

ABSTRACT

The adoption of 64 bit architectures went along with an extension of the virtual address space (VAS). To cope with this growth, the memory management unit (MMU) had to be extended as well. For paging-based systems like Intel’s x86-architecture this was realized by adding more levels of indirection to the page table walk.

This walk translates virtual pages to physical page frames (PF) by performing look-ups in a radix / prefix tree in which every node represents a page table (Figure 1a). Since the tables are part of the translation process, they must be referenced by physical page frame numbers (PFN, blue line). As the operating system is only eligible to access the VAS, it cannot follow the path of a walk. In order to allow the manipulation of page tables, it must provide:

Page table walk in the x86 64 longmode: Traditional, without self-reference. — Figure 1a): Page table walk in the x86 64 longmode without self-reference.

Access to the table entries, by mapping the tables themselves to the VAS.
A mapping between physical references to corresponding locations in the VAS.

Additionally, every level of the page table walk increases the complexity of managing these mappings. They also increase the memory consumption by occupying physical page frames. It is possible to avoid both drawbacks by the technique described in the following.

In my bachelor thesis, I presented an approach, which is compatible with both the 32 bit and 64 bit version of Intel’s x86-architecture. This allows for a replacement of two code bases, one for each architecture, by one supporting both. Thus, results in a shorter, easier comprehensible, and maintainable code. As foundation for this implementation our teaching OS called “eduOS” was used². “eduOS” supports only the 32 bit protected mode whereas the 64 bit longmode is only implemented for an internal research kernel.

Thanks to the sophisticated design of Intel’s x86 MMU, it is possible to avoid most of the complexity and space requirements by using a little trick. Adding a self-reference in the root table (PML4 resp. PGD) automatically enables access to all page tables from the VAS without the need for manual mappings as described above (Figure 1b). The operating system does not need to manually follow the path of a page table walk, as this task is executed by the MMU for accessing individual tables instead of page frames.

Page table walk in the x86 64 longmode: With self-reference. — Figure 1b): Page table walk with self-reference.

An access to the VAS region covered by a self-reference causes the MMU to look up the root table twice (red line). Effectively, this shifts the whole page table walk by one level. Therefore, it stops with the PFN of page tables instead of page frames that are usually translated by the MMU. Here, both the PML4 and PDPT indexes are used to choose an entry out of the PML4 table. Therefore, it must be guaranteed that PML4 entries can be interpreted as PDPT entries, too. This demands for the following requirements:

Homogenous coding of paging flags across all paging levels.
Equal table sizes across all paging levels.

Fortunately, the x86-architecture complies with this prerequisites as shown in Figure 2. Green colored flags are coded consistently across all paging levels. Only PAT, size and global flags have a slightly different meaning for entries in the PGT. My bachelor thesis shows that these deviations still allow maintaining full control caching and memory protection properties of self-mapped tables. This includes for common system calls like fork() and kill().

Figure 2: Similar flags across all paging levels.

By repeatedly addressing the self-reference, it is also possible to access tables of the upper levels (PGD to PML4). Table 1 shows the resulting virtual addresses of all page tables when using the last (512th) entry of the PML4 table for the self-reference³. This grants access to all possible page tables, including those which might not yet exist and may be allocated in the future. Hence, the self-reference reserves a fixed fraction of the VAS for the page tables. The size of this region is equal to 256 TiB / 512 = 512 GiB for 64 bit (resp. 4 GiB / 1024 = 4 MiB for 32 bit), which is negligible in comparison to the huge VAS of 248 byte.

Table 1: Virtual addresses of self-mapped tables.

For the manipulation of page table entries two approaches
are feasible:

Top-down Use known tree traversals, starting at the root node,
which corresponds to the PML4 respectively PGD.
Bottom-up Use the page fault handler to create new tables on-the-fly,
when they are not yet present.

But there are also other architectures which satisfy the prerequisites described above. One of these is the Alpha⁴ architecture, which suggests a similar approach in the reference manual. Intel and AMD do not mention the technique in their x86 manuals. In the field of operating systems, support is far more limited. There is only a single reference⁵ dated to 2010 indicating that Microsoft might use a similar approach for its NT kernel. Linux cannot profit because its paging implementation must support a broad selection of virtual memory architectures of which not all fulfill the requirements mentioned above.