“All the News That’s Omg Lol”

Online Edition
As if we’d actually use ink for this?

Vol. 18 of 36
March 2, 2023 3:20 AM UTC
Free

Infrastructure Month: A Retrospective

Hello, it is me, your somewhat disheveled captain of this elaborate internet vessel that we call omg.lol. I’ve spent the past few weeks giving our platform a sorely needed upgrade, building and rebuilding servers, provisioning load balancers, testing things, breaking things, fixing things, and repeating the process more times than I would ever care to admit. It’s been super interesting! And, at times, infuriating! And, ultimately, a bit embarrassing. But more on that in a moment.

Take 1: There was an attempt

In my last update, which I thought would be my final update on the topic (hah, how naïve), I talked about the factors that led to the need to update and scale our infrastructure. I didn’t go into much detail on what exactly was done, though, so I’ll share that now: we went from a single server that did everything to a nice new setup that included a new load balancer, four edge nodes, and a dedicated API server. It took just under two weeks to design, build, test, and refine.

When it was finally finished, it was amazing! I was so proud of what I had done, and so grateful to the people in our community who offered advice and expertise. I was watching traffic flow through the load balancer, reaching the edge nodes, and everything was holding up wonderfully. The recent attacks were being handled with relative grace and things were so much more stable. Mission accomplished!

Well, no, not quite.

A few days in the attacks began to pick up in intensity, reaching a point where they were starting to disrupt things again. “No problem,” I said. “I’ll just roll out this fancy new intrusion detection and mitigation system that I made, which drops offending traffic at the firewall level, and we’ll live happily ever after!” So I turned that on and turned my frown upside down.

But, uh, it didn’t work. At all. The reason, as it happens, was because of a specific chain of decisions during the design and build phase of this project, which were:

  1. For the load balancer, instead of creating my own, why not use Hetzner’s purpose-built, rapidly deployable offerings? Yeah, that’s great, I’ll run with an LB11, which can handle 10,000 simultaneous requests.

  2. Hetzner’s load balancers offer TLS termination or passthrough modes; TLS termination would be nicer, but their most advanced setup is limited to 50 certs. We have hundreds of members who use external domains pointed at omg.lol, all of which need certs, so I’ll use a passthrough setup.

  3. The edge nodes need to be able to identify traffic sources, but fortunately Hetzner’s solution supports the proxy protocol, which means I can configure Apache to recognize that traffic and properly distinguish inbound traffic.

All of this was fine, right up until the point where I realized—in a painful, embarrassing way—that I couldn’t properly block bad traffic because of the proxy protocol setup in this particular load balancer. Sure, I had Apache tuned to work with it perfectly, but nothing else could. The most I could see or block was the load balancer itself, which clearly wouldn’t do anyone any good. And as you can see from the decision points above, any attempt to backtrack on any of these decisions would result in not being able to manage the certificates for external domains.

So, yeah. What a mess. And the messiest part of all was that just a few days before coming to the terrible, nauseating realization that I picked the wrong approach to load balancing, I emailed several hundred members to say “new IP address, who dis” and asked everyone to drop what they’re doing and modify their DNS settings to accommodate my change. The full weight of what I had done, and what I would have to do next, hit me in a really unpleasant way.

But all of the self-pity and self-loathing in the world wasn’t going to fix the problem. So I got to work on that.

Take 2: Doin’ It Right

With an insane amount of help from @jack, who is a literal genius about stuff like this, we went back to the drawing board. And after more planning and testing, we came up with this amazing and ridiculously improved setup:

  1. Instead of an automated Hetzner-issued load balancer, we switched to a separate box that runs Caddy. Caddy handles all of the magic of certificate management (in a way that really does feel like actual magic).

  2. Instead of sticking with the same legacy Apache setup for the edge nodes, which relied on centrally managed certificate issuance and distribution and a complicated dynamic virtual host setup, we went with Nginx configured in an incredibly light and simple way.

  3. Instead of using a fixed IP address on the load balancer, I procured a floating IP address (so the next time things need to change, everyone’s DNS records can be left alone).

This new setup is incredible. It’s faster, cleaner, and even more resilient than before. No more weird proxy protocol stuff — just simple traffic between load balancer and edge, all flowing through a secure private network. And most importantly, I can actually do stuff on the load balancer — like block unwanted traffic — which I couldn’t do on the purpose-built Hetzner box.

To be clear, this setup still runs in Hetzner’s datacenters. Hetzner is awesome! There was nothing wrong with their load balancer offering — it just wasn’t the right approach for omg.lol, and I failed to recognize that before choosing to use it.

Next Steps

At this point, everything is up and running with the new new setup. There may still be a few lingering bugs (I think there were some lingering after the first migration, and now with two migrations back-to-back anything is possible). I’ll be working on identifying those rough spots and sanding them down as quickly as I can.

I’ll also soon send an email that I’ve been dreading:

I can’t believe I have to do this again

That’s right... because of this change, if you have an external domain pointed at omg.lol, you’ll need to change the IP address to 5.78.24.5. The prior IP address still works because of a temporary setup I have in place, but you’ll want to move things over as soon as you can.

So, yeah, that was a lot. I don’t know if any of this was overly interesting to anyone, but the entire experience was highly educational (and humbling) for me. Mostly I’m just really happy to put Infrastructure Month behind me so I can finally get back to the fun work of making omg.lol more awesome. I feel really bad for the bumps over the past few weeks, including the final bump of another IP address change so close to the last one. But I can confidently say that this really is it — with this new setup, we’ll be good to go for a very long time!

Anyway, back to the bugfixes and new features. Talk to you soon!

— Adam

Permalink