The graph database for MCStats (using MongoDB) now contains over 1 billion documents. The 1 billionth document was inserted on January 29, 2016!
I’ve been working on getting the data into Google BigQuery so that I can play around with the data a lot more than I can with MongoDB and after a day of massaging the data I was able to load it successfully!
Loading data is a very much manual process. The data covers 2014-11-13 to now. November, 2014 is when I started keeping _all_ data stored again after moving away from Digital Ocean to elsewhere where it is more feasible to keep all data instead of purging.
The last few days
After moving everything to a new server, it was discovered that there was corrupted value in a collection whenever Mongo tried to scan it which caused mongod to crash. This was found when backups of the database were being ran. Since the corrupted value wasn’t touched from normal usage (opening graphs on the website), it’s likely that the corrupted value is some piece of graph data that’s over a month old. There’s over a year’s worth of graph data stored in Mongo currently.
A few weeks ago MCStats completed its migration to DigitalOcean.
Moving away from a single dedicated server to multiple smaller “cloud” servers allows MCStats to run much smoother and faster. The SSDs DigitalOcean offers allowed this to happen without any issues.
MCStats has used PHP to generate graphs since it was created. The first “backend” (the endpoint servers send data to) was also in PHP, but it was soon written in Java once a need for caching in-process was desperately needed.
For the past few months graph generation has been taking longer and longer. It got to the point of taking ~15 minutes to generate all of the graphs for every plugin. For the graphs to even generate the backend needs to keep MySQL up to date. Keeping it very up to date requires some 2,000+ queries/sec constantly. In a single machine and keeping MySQL small this destroys MySQL and response times were through the roof as the queries done to aggregate data for the graphs themselves were expensive and took even longer with the massive amount of queries the backend was doing.
So what what was the main step towards fixing this?
No data has been generating for the last week for a few reasons:
- All graph data has been moved from MySQL (500+ million rows) to MongoDB (the actual converting data over only took a couple hours though)
- At the same time I physically moved to a different machine. This was primarily to get out of OVH’s lacklustre network in Montréal. MCStats is now located in New York.
- Naturally, Real Life got even busier so what should’ve only taken a day had to be postponed.
Now that the site is back up, what is new?
- Improved page caching. Most pages (and all API requests) are now cached in Redis. Previously I was using Memcached for caching specific pages but Memcached appeared to have a mind of its own (or it was the php driver) so I took the chance to switch to Redis at the same time which has worked as expected the entire time.
- Graph generator progress bars have made their way back onto the site.
- http://api.mcstats.org/api/1.0/ is now more simply http://api.mcstats.org/1.0/.
- The backend now supports compression+JSON requests. The new R7 reporter will take advantage of this once released.
- Graphs for Plugin Rank have been added (finally). This will be more interesting once the plugin index is finished but it’s there 🙂