How to lose 50 Million Records in 5 minutes

YOUTUBE Qbxmf_TxA-s Jon Druse presents at Rubyconf 2019

Druse gives a colourful talk about how through a series of decisions they lost all their data.

This talk looks like it's related to another talk on Stability Patterns & Antipatterns

12:05 The path to ruin, and how the mistake ended up being made.

How you can ruin any application. Decide you need to make a change, then ignore the test suite. Then you're in a status quo where it's too expensive or time consuming because we can't fix it. Then we have problems that arrise because of the poor status quo. Then we have quick fixes. We don't research the problem to it's depth, you're just guessing. Enough guesses have a last ditch effort which MAY turn into disaster.

Their root problem was using timestamps as keys in their JSON records. This doesn't really fit with the way Elasticsearch thinks about data.

How to handle catastrophy: Stay calm, Work the problem, Have backups

"Work the problem people, not make things worse by guessing". When you guess, you're just grabbing at straws but you have no idea what you're doing. This comes from the movie Apollo 13.

24:01 Things to do to avoid catastrophe

Take Responsibility for the things you work on. Viewing things as somebody else's problem wont get you far.

Follow best practices. Most of our systems are layers that organize other systems, you really have to listen to how those systems ask you to use them or they won't behave as you expect.

Investigate root causes. If you get a weird error and you just swing at it with what you have, then it might or might not work. If it does work, that's more dangerous because you're suffering from Confirmation Bias and won't know there's something wrong.

You will never use your program less than you do at the moment. Most businesses and processes are about growth, which means those assets need to last.