Dev Deletes Entire Production Database, Chaos Ensues
If you’re tasked with deleting a database, make sure you delete the right one.
Sources:
Notes:
1:05 - The middle bullet point about the account that had 47,000 IPs was never mentioned in the postmortem (there was an initial report the day of and a more detailed postmortem a bit over a week after that). Perhaps that was a red herring which they figured out later on didn’t really matter.
3:07 - I made the error say too many open connections since it’s easier to understand than semaphores
3:39 - This part was confusing, since the postmortem and the initial report conflicted. The postmortem said the engineers believed pg_basebackup was failing because previous attempts created some files in the data directory, but the initial report said the theory was because the data directory existed (despite being empty). So for some reason the engineers really wanted to delete the data directory, but for what reason who knows.
4:37 - They probably didn’t check for backups in this order. I’m sure team-member-1 immediately called out he had taken a backup 6 hours earlier, and then they just had to verify the other backups in case there was a better one.
6:21 - Being reported by a troll will not automatically remove a user, but flag it for manual review. It was then incorrectly deleted after review.
Chapters:
0:00 Seconds before disaster
0:16 Part 1: Database issues
2:21 Part 2: The rm -rf moment
4:32 Part 3: Restore from backup
6:13 Part 4: Post incident discoveries
7:27 Lessons learned
9:46 The fate of team-member-1
10:11 ???
Music:
- Thriller Trailer Teaser Tense by Cold Cinema
- Finding the Balance by Kevin MacLeod
- Eyes Gone Wrong by Kevin MacLeod
- Desert City by Kevin Macleod
- Jane Street by TrackTribe
1 view
202
53
7 months ago 00:10:20 1
Dev Deletes Entire Production Database, Chaos Ensues