Let’s talk about downtime

The last few days

After moving everything to a new server, it was discovered that there was corrupted value in a collection whenever Mongo tried to scan it which caused mongod to crash. This was found when backups of the database were being ran. Since the corrupted value wasn’t touched from normal usage (opening graphs on the website), it’s likely that the corrupted value is some piece of graph data that’s over a month old. There’s over a year’s worth of graph data stored in Mongo currently.

The error when Mongo crashed and burned:


2016-01-04T21:40:45.904-0500 E STORAGE [initandlisten] WiredTiger (0) [1451961645:904694][23909:0x7fb200659bc0], file:collection-4--2802777555639207910.wt, cursor.next: snappy error: snappy_decompress: SNAPPY_INVALID_INPUT: 1
2016-01-04T21:40:45.904-0500 E STORAGE [initandlisten] WiredTiger (0) [1451961645:904744][23909:0x7fb200659bc0], file:collection-4--2802777555639207910.wt, cursor.next: file:collection-4--2802777555639207910.wt: encountered an illegal file format or internal value
2016-01-04T21:40:45.904-0500 E STORAGE [initandlisten] WiredTiger (-31804) [1451961645:904752][23909:0x7fb200659bc0], file:collection-4--2802777555639207910.wt, cursor.next: the process must exit and restart: WT_PANIC: WiredTiger library panic
2016-01-04T21:40:45.904-0500 I - [initandlisten] Fatal Assertion 28558
2016-01-04T21:40:45.913-0500 I CONTROL [initandlisten]
0xf81182 0xf1e409 0xf02606 0xda3561 0x13d71dc 0x13d738d 0x13d7804 0x133facb 0x1344688 0x1341263 0x1355a6e 0x1356472 0x1327cad 0x13750ec 0xd920fb 0xd921b3 0xa07068 0xbe2252 0x932d60 0xab7764 0x80bc25 0x7d8619 0x7fb1fec1dec5 0x8085ec

Yikes.

Thinking it might help, I tried db.repairDatabase() which only made things worse. Mongo then refused to start because it attempted to rebuild the index, tried to scan over the corrupted field, then immediately crashed. After some searching, I came across this SO post which was exactly what I wanted: something to strip away the corrupted fields but leave everything else.

Installing WiredTiger + Snappy on Ubuntu is quite straightforward: download WiredTiger; apt-get install libsnappy-dev; compile WiredTiger with --enable-snappy. Then I could run salvage on the affected WiredTiger db file:


# LD_LIBRARY_PATH=/usr/local/lib ./wt -C "extensions=[libwiredtiger_snappy.so]" salvage collection-4--2802777555639207910.wt
[1451962717:703933][9707:0x7fdb1331f740], file:collection-4--2802777555639207910.wt, WT_SESSION.salvage: snappy error: snappy_decompress: SNAPPY_INVALID_INPUT: 1
[1451963170:615578][9707:0x7fdb1331f740], file:collection-4--2802777555639207910.wt, WT_SESSION.salvage: snappy error: snappy_decompress: SNAPPY_INVALID_INPUT: 1
[1451963645:419521][9707:0x7fdb1331f740], file:collection-4--2802777555639207910.wt, WT_SESSION.salvage: snappy error: snappy_decompress: SNAPPY_INVALID_INPUT: 1
[1451964144:694852][9707:0x7fdb1331f740], file:collection-4--2802777555639207910.wt, WT_SESSION.salvage: snappy error: snappy_decompress: SNAPPY_INVALID_INPUT: 1
#

Restarted Mongo and it successfully reindexed! Corrupted fields are gone, everyone is happy again.

Also, earlier in the week there were some very spiky graphs being generated. This was being caused by a misconfigured MariaDB server.. It was using mostly default InnoDB settings which made graphs take 30 minutes to generate instead of 20 seconds … whoops on my part. It was a good reminder that every custom kernel and MariaDB option MCStats uses is incredibly important to achieve performance on a very low budget.

The last few months

The server would sporadically become overloaded -> tens of thousands of connections would get stuck waiting -> OVH worsens the situation by thinking it’s an attack -> hardware reboot required for me to even SSH in.

Unfortunately, during this time I didn’t have time to properly investigate so I could only keep up with making sure it came back up as soon as possible (which even then, I was not available for several days at times). The best I know is that it did not ever happen before the move to a docker-based setup.

After moving back to a more traditional setup (managed with Ansible) it has not crashed in the last week the way it has been over the last few months at all where before it would happen every day or two usually. Of course, it could happen again if it’s an underlying issue but if it does I’ll be able to better investigate this time around.

Going forward

I would still like to complete the work I started on the ng branch for telemetry-server. The ng branch as it stands right now is a lot of experimenting, so I see it as the prototype for ideas I’ve tried.

Right now, a lot of overhead with graphs and graph generation comes from assigning IDs to columns in a unique <Plugin, Graph, Column> tuple (where Column = a line on a graph; e.g. “Players” for a Global Stats graph). This ID is completely useless however incurs a lot of overhead on generation (creating if non-existent; caching all known columns; mapping name -> id) and viewing graphs on the website (mapping id -> name). This gets pretty expensive when you have graphs like the version graphs, Server Locations, etc etc, that have many, many columns to load. Removing the ID makes storing and retrieving data incredibly simple. The extra storage overhead has been tested to be near nil thanks to WiredTiger’s compression.

Other improvements I’d like to make include making the server more resilient to restarts (all data is held in memory until generations happen, so restarts destroy data in the last <30 minutes) as well as keeping a better archive of generated data.

One thought on “Let’s talk about downtime

Leave a Reply

Your email address will not be published. Required fields are marked *