1D: tailscale/headscale

Tailscale is an overlay Mesh network. It relies on a central login server which then brokers wireguard connections behind the scenes. The tailscale also offers DNS, ACLs and a whole bunch of other bits around the side of just transferring data between nodes.

Tailscale is partially closed source and has developed by a business with paid offerings for enterprise. A bunch of their stuff is open source though, and from their blog https://tailscale.com/blog/ they seem a pretty cool company with a heavy focus on fancy new tech solutions, and with a clear route for funding.

The most notable closed source part of Tailscale is the login server which Tailscale host. Headscale is an opensource implementation of Tailscale which you can selfhost!

Tailscale even acknowledge this on their blog and seem to have a nice relationship with the headscale dev. The main headscale dev is also super polite and considerate towards Tailscale, it's quite a sweet & wholesome relationship to read about on github.

How do I use it?

Currently running Headscale internally on my main server, deployed via docker compose (see Docker containers on overlays)

DNS

I keep hitting DNS issues again, here's some relevant notes.

Tailscale client side sets the client DNS to 100.100.100.100 which points to a recursive DNS server built into the tailscale client.

The inbuilt DNS client can recurse to the servers specified on the tailscale/headscale (config.yml) login server, including IP addresses within the tailscale ranges.

However, there's many moving parts here with many layers of fallbacks and failovers which complicates things and leads to general flakeyness.

2024-03-20 listen port ignored

Noticed newer containers I built by installing tailscale via the Dockerfile were appearing as relayed in my LAN. I realised that the tailscaled daemon wasn't respecting the --port parameter if a login server was specified.

Turns out the issue was I had randomisclientports enabled in headscale config. There's still some inconsistencies here: I think most normal people would read the docs and assume that --port should take priority: https://github.com/tailscale/tailscale/issues/11174

DERP

DERP protocol is what tailscale uses to establish connections between nodes, including NAT traversal. There's a bunch of publicly available free to use tailscale DERP servers, but, y'know, selfhosting...

Headscale has a built in derp server.

For the past ~12 months (2024-05-19) I've been using the headscale built in DERP server, and reverse proxying that into Oracle for devices outside my LAN, using some DNS hackery so internal devices go straight to the server locally, but external devices go via the cloud proxy.

This has generally worked, but there's been a few times where traffic has been relayed when it didn't need to. I think this is because the DERP server will try gather a collection of IP addresses on which it can reach each node, and then give those IP addresses out to other nodes so they can form direct connections. Because the headscale server will only ever get the LAN IPs for the local devices, external devices will never get my LAN external IP to reach the internal nodes.

This is all a compromise to facilitate self-hosting headscale locally. The easiest/recommended approach would be just to run headscale directly in Oracle, that way all nodes - remote and local - have the same view and interaction with the single headscsale DERP server. The downside of doing this is that the core part of the system is hosted externally, which has 2 main disadvantages:

if my internet dies, eventually, all my local stuff will fall apart because it can't reach the coordination server
if any local nodes can't get direct connection (not uncommon if I'm doing many layers of virtualisation and networking) then it will be relayed via the internet.

Both of these 2 things are low chance but massive impact for me, I can't stand such extreme inefficiencies like complete loss of service, or making what should be at most a few metres trip around my home network into a cross-country round trip.

Just use 2 DERP servers

It was super easy to just use the built in DERP server and then a few select DNS records to create the setup above. But really, I should just use 2 DERP servers, 1 in oracle, and one locally. This solves all the problems, as when the internet is working, all nodes will be able to use both DERP servers, and the external one will be able to get the same view of remote and local nodes. If the internet dies, or 2 local nodes can't communicate directly, the local DERP server should keep everything running.

Here's the docs for custom DERP: https://tailscale.com/kb/1118/custom-derp-servers it's a golang package.

Initially going to manually install and configure it on core lxc to see how it works:

install golang (had to do it manually, download tar.gz from go site as rocky default repos had old version.)
install and run derper e.g. derper --hostname=i.derp.1dom.fun

Hitting issues: setting the derpmap location and running headscale, every 10 mins headscale reloads the derp map, but keeps giving warnings about no derps loaded from map....

A few debugging notes:

go install tailscale.....cmd/derpprobe
derpprobe seems to prefer json files, so converted derp config from yaml to json,
run e.g.fderpprobe --derp-map=file:///opt/docker-compose-headscale/data/etc/i.derp.test.json
initial errors were tls, because it's internal server and tries to use certbot to automagically get SSL, but letsencrypt can't reach to verify
solution for me quickly was to just manually generate ssl certs, and as per https://github.com/tailscale/tailscale/issues/2794 use -certdir ..... -certmode manual with derper command.
once that was resolved, kept seeing ipv6 failures. since already provide hostname and ipv4, removing the ipv6 address seemed to fix things
derpprobe doesn't seem to do anything when it works, but errors for things that don't
restarting headscale, and running tailscale netcheck on another node on the tailscale network showed my custom region derp server alongside the builtin headscale one.
can manually set derp listening port to 4433 e.g. /root/go/bin/derper --hostname=i.derp.1dom.fun -certdir /opt/docker-compose-headscale/data/etc/certs -certmode manual --a=:4433

must also then update derp map node entry w: derpport: 4433

ACLs

Need to setup ACLs on there, had some conversations with different local LLMs, ended up using gpt4o for actual policy gen after conversation. But going through that made me investigate setting up multiple users. I created some test users and moved them around with headscale nodes move -i ... -u .....

Without ACLs, all users can still speak to eachother fine.

Retrofitting ACLs isn't fun. The more stuff already running, the more likely turning on ACLs with break something.

A quick list of initial things I want the ACLs to cover:

all nodes -> otelcollector, adguardDNS

end-users -> kasm, uptimekuma, openwebui, gitea

otelcollector, uptimekuma, gitea(runner?), kasm -> all nodes (initially?)

Issue I'm seeing now is that I can use tags, users, groups and hosts to restrict stuff. The nicest from my position seems to be tags. Issue with that is that tags are controlled by users, so as long as all nodes are owned by the same user, all nodes can manage tags. Same goes for if I use users or groups: with all nodes being the same user, and that user is a power user, all bets are off.

So current thinking is I need to bring in users, I propose at least these 3 users:

dom - admin, manages tags, acls, owns my end user devices too. can access everywhere

svc - user for owning all the service nodes (e.g. everything not an end user device)

dad - standard regular, just owns his own end user nodes

Then tag wise, we probably want to think about high level groups e.g

end-user-devices

admin-devices (can access everything, so allows me to access openobserve interface, adguard interface, anything that regular users shouldn't have)

core-services (manage DNS, uptime, logging, and remote access, can reach out to stuff.)

internal-services (manage user access to gitea interface, uptime kuma interface, kasm interface, openwebui interface)

public (managing publicly accessible things)

Probably want to have a core-services-consumer group which I guess most nodes would have

Then e.g dns-service-provider tag and log-service-provider

Then rule e.g. allow dns-provider -> core-services-consumer port 53

So I can see a few iterations here to get somewhere good.

step 1: 5 tags and generous portless rules for the tags

this should give me enough to create some high level boundaries between me, Dad, and public stuff.

step 2: move nodes to appropriate users

2024-06-22

Added an acls json via openwebui, but since all nodes are added as dom user, dom user is group admin, and admin group can access all nodes based on middle rule, these ACLs do nothing:

acls json


{
  "groups": {
    "group:admin": ["dom"]
  },
  "tagOwners": {
    // admin (dom) can tag the nodes
    "tag:end-user-devices": ["group:admin"],
    "tag:core-infra": ["group:admin"],
    "tag:internal-services": ["group:admin"],
    "tag:public-infra": ["group:admin"],
    "tag:dns": ["group:admin"],
    "tag:syslog": ["group:admin"]
  },
  "Hosts": {},
  "acls": [
    // Dom's main end-user devices access
    {
      "action": "accept",
      "src": ["tag:end-user-devices"],
      "dst": [
        "tag:syslog:*",
        "tag:dns:*"
      ]
    },
    // Admin can access all nodes
    {
      "action": "accept",
      "src": ["group:admin"],
      "dst": ["*:*"]
    },
    // Core infrastructure reach out to other nodes
    {
      "action": "accept",
      "src": ["tag:core-infra"],
      "dst": ["*:*"]
    },
    // All nodes can access DNS and Syslog
    {
      "action": "accept",
      "src": ["*"],
      "dst": [
        "tag:dns:*",
        "tag:syslog:*"
      ]
    },
    // Internal services accessible by end-user devices for web interfaces
    {
      "action": "accept",
      "src": ["tag:end-user-devices"],
      "dst": ["tag:internal-services:*"]
    }
  ]
}

commenting out the admin rule without applying any tags to anything causes basically everything to break, because pretty much all the rules depend on tags, and everything is blocked by default, so no tags, no rules, no traffic.

adding tags to nodes

acls:


# desktop
docker exec -it headscale headscale nodes tag -i 1 -t admin-devices
# adguard
docker exec -it headscale headscale nodes tag -i 5 -t dns,internal-services
# domplus 7 pro
docker exec -it headscale headscale nodes tag -i 6 -t admin-devices
# gitea
docker exec -it headscale headscale nodes tag -i 7 -t core-services,internal-services
# uptimekuma
docker exec -it headscale headscale nodes tag -i 8 -t dns,core-services,internal-services
# uptimekuma
docker exec -it headscale headscale nodes tag -i 9 -t dns,core-services,internal-services