2023-06-23

Friday, June 23

I DID IT. I BOOKED A TRIP TO EGYPT. HOLY FUCKING SHITBALLS IM GOING WITH BEN NEXT YEAR.

OH. MY. GOD. THIS. IS. HAPPENING.


Wrote a post in Discord today about the nightmare that was last night.

Mate it was a particularly awesome series of events. I haven’t been responsible/on the front line for an outage this bad for a longggg time.

We have been doing some psuedo-firmware level stuff for a client along with general application uplift. I’m currently a stand-in CTO for this client whilst he rebuilds his team. This client does digital signage, and has a cool editor thing. Its like Photoshop for digital signage, but its really just a pretty react app. Lots of my clients customers use it internally for advertising etc. The way it works is basically:

Deploy a player runtime to a screen. Register the screen with the central system. This is basically the client assigning a licence to a screen. Create media content and assign it to the screen (a “playlist”) The screen downloads and caches the content, and then uses the playlist manifest to determine when and what to play.

The “deploy a player runtime” is a sometimes-compiled, sometimes-generated interpreted artefact. ie - it can be a HTML file or an android app, a custom linux image compiled to whatever screen instruction set (normally arm v7). We currently compile out to 8 separate targets. Remember this for later.

When you create a playlist you can schedule it, ie “at 5pm on a Friday, display this thing until midnight Saturday”.

The client recently won a drive through contract. Essentially their signage technology was going to have to accept real-time orders and display it on a panel when a customer orders food. I have built this for them, tested extensively and it worked great. We have had it in uat for 2? or so months. At the same time we were uplifting their production environment. Their prod runs on a single EC2 running docker engine 18.4 or something in a single docker-compose process 🤦‍♂️ . The way their CD works is basically it builds the images and generates a new docker-compose file with the container uuids.

All internal service routing is done via an nginx reverse proxy sitting at the front of the compose “cluster”. Essentially all of the services are routed like this: host/api/. All of these services are baked into the downstream “player runtime” apps, so we can’t just transparently change it.

We have a nice new kubernetes set up that has taken us nearly 3 months to set up. It all works flawlessly, and finally lets us get detailed runtime logs etc. This was scheduled to be deployed this week, but the drive through client (a massive client, it basically 5x’s their revenue next FY) moved the deployment date up by a week, ie yesterday.

This compose runtime has essentially no logs nor APM. We were so close to deploying the new version we didn’t bother updating their current one. Remember, their client moved their data forward a week. No worries, we retrofit in our new realtime stuff into their existing Ci pipeline, know it will only be there for a few days whilst we finish e2e testing on our new fandangled k8s, and life will be good.

Here is where shit hits the fucking fan.

In the UAT branch there was some code that was committed 6? months ago but never released to prod. This is before we came onboard, and was basically some dropkicks “security test” requirement. Essentially because the s3 bucket where content is stored allows public-read these morons said it was insecure. Keep in mind this media gets downloaded and literally played to the public. To make it “secure”, some code had been committed that attached the following query params to every NEW file uploaded to s3 via the app and saved it in the DB:

some-cloudfront-host.images.client.com/media/cat.jpg?AWSAccessKey=XXXXXXXXXXXX&ExpiryTime=99999999999 (expires 2038)

Remember, the security change added the ?AWSAccessKey=XXXXXXXXXXXX&ExpiryTime=99999999999 part to the filename that wasn’t there before.

When the media is uploaded it sets a perm policy on the object. When the playlist is sdispatched to the screens it uses the saved the path and token in the file name in the DB. It comes exactly out exactly like the file name above. The player run time then downloads and caches the file preemptively, using the file name as the key.

This is where the issue is. We compile out to 8 different OS. These OS run on hundreds of versions of hardware and firmware. None of them are the same. And it turns out that a lot of them dont like the characters ?, & and = as file paths. But it also turns out a lot of them dont care.

We have a remote lab set up with 22 screens, all running different hardware, firmware and OS’s. And every single one of these configurations didn’t care about the cache key. The code that was in UAT for 6 months threw no errors on our extensive hardware test suite.

We also deployed out UAT to production on Sunday morning and had no issues until Friday lunch. Why - well it turns out people schedule content for the weekend, around 11am Australian time on a friday. All of the new content had been changed from cloudfront.com/media/image.png to cloudfront.com/media/image.png?AWSAccessKey=… but was left “dormant” until this weekend’s specials etc. We went from 0 issues to thousands in the space of about 45 minutes, all the while all of our stuff was working perfectly for a week.

To make it worse in Australia the way most of this is done is via hardware resellers. The reseller installs and supports the clients screen, and just use this application as a backend. This means we have no device diagnostic to look at. Further, the compiled artefacts are somewhat limited in their reporting currently - its on our roadmap to get through this but we haven’t had a chance yet to implement anything.

So here we are:

we’ve tested for months and everything is perfect we have 22 different configurations pointed to prod with no errors we have no reporting or diagnostic data we can answer all of the support requests came through like a giant flood

There are other, more subtle parts to this nightmare. For example, the player runtime error shown to a user was “Unable to download content”. Except it could download the content, it just couldn’t save it to cache. And since we had no diagnostic information we could only go off this misleading error 😬

I’ve just realised I’ve written the bones of my post mortem before get I chewed the fuck out on Monday by the client 😄

To enter the above I actually used vims visual block mode for the first time - C-v S-} S-i > which is a crazy cool way to indent text like that 🤙. Thanks kind internet stranger.


Went to Shawns bar and hated it. Fuck that place man - service was bad although I love the waitress. I ended up going a bit crazy and getting in an uber with some dude named Felix. I brought 10 grams of coke and gave him 5 like a fucking animal.

reeee.