Behind the Scenes of a Critical Event ~ Why We Had to Rollback

  • rollback.png?w=980&ssl=1


    Thursday, October 10th, 12:12 pm – Lunchbreak – Travian Games HQ


    It’s a peaceful day in the Travian Games HQ in Munich. Some employees are having lunch, some are silently working, some are playing Travian: Legends or just chitchatting with colleagues.


    But all of a sudden something goes wrong.


    Our Product Owner Brian receives an unusual message while selling an item in an auction on Travian: Legends, Martina’s (our Community Communication Manager) phone goes crazy with a multitude of Whatsapp messages from her alliance, our Slack channel where we communicate with the Legends on Tour Ambassadors starts beeping and simultaneously our inbox on Facebook gets alarmingly full.


    It doesn’t take much to connect the dots. Something serious is happening and we are ready to kick-start our crisis management process. We run to our developers and that’s where it all began.


    But let’s go back in time for second…


    On Wednesday, the 9th of October, our standard deletion action (inactive accounts, deleted accounts, etc.) that is triggered every second stopped working properly and incomplete deletions were detected. An immediate bug task was created and a hotfix to tackle the issue was implemented and tested. Standard process, nothing wrong there.


    But there was something we could not foresee…


    Instead of processing deletion requests regularly, our system got bombarded every second with a growing list of deletion requests. This continuous flood of requests apparently resulted in the corruption of the data.


    What does it mean exactly?


    UserIDs of the accounts that should have been deleted were scrambled into something else. The total amount of accounts was still valid, but the actual IDs of the accounts were wrong.


    Our deletion routine is indeed one of our older systems, as we mentioned in one of our Ask Travian episodes, but so far it has been very reliable. Of course, after what happened we are already looking into ways of improving it.


    Now back to the 10th of October, at 12:12 pm.


    The mentioned hotfix was deployed around noon and made the system work again exactly at 12:12 pm, allowing the execution of all the piled up deletion requests.


    Unfortunately, as we explained, those UserIDs were corrupted.


    In total, thousands of accounts were affected on 170 out of our 184 gameworlds and many of those deleted accounts belonged to active players.


    The first 70 gameworlds we announced on Facebook and on our Forum were immediately identified as the Natars were missing, but additional 50 gameworlds were added later manually by checking each database. The rest had to be analyzed case by case since some gameworlds had just a handful of deletions that could not justify a rollback.


    If you were wondering why we couldn’t share all the details from the very beginning, that’s exactly the reason. Updates were coming directly from our dev team, so we could only share live news through our channels. Nothing was set in stone.


    And trust me, you don’t want to constantly bother someone who is trying to put out a fire…


    As you probably know, rolling back a server is not just a matter of pressing a button. It takes several hours of work and there are tons of tiny details to take into consideration. And multiply that for 120 servers.


    It is indeed a lot of work that needs to be prioritized and executed.


    To speed up the rollback process, we decided to dump the affected databases of the 120 gameworlds instead of writing backups and replace them immediately with the restored databases from the latest backups we had. The standard rollback process would have taken up to 7 days but of course, we did not even consider that option. We had to act fast.


    We run daily backups, which means that in total we lost 12-36 hours of data for each gameworld, depending on the date and time of the backup.


    Yes, some gameworlds were luckier than others in that regard, but that’s why we wanted to speed up the entire process.


    On Thursday and on Friday our developers, system administrators, and everyone involved in supporting Travian: Legends literally worked day and night to retrieve all backups and communicate the current status to players.


    At 11.35 am on Friday, all backups were finally restored and it was time to schedule restarts. To keep you updated as much as possible, we set up a spreadsheet where we published the estimated restart time. To save time, we decided to update it live, as soon as we got news from our devs.


    The order of the domains was completely dependant on availability. No domains where favored, in case you were wondering.


    Our goal was to bring the first batch back online around 2 pm and we managed to achieve it. Then the rollback was performed on one batch after the other, and finally, around 8 pm all servers were back online.


    But it was not over. All Gold payments that happened between the backup date and the time when the game world went on maintenance mode needed to be rebooked and finally compensation had to be granted.


    And yes we also want to address the hot topic of compensation. As it is impossible to satisfy the needs of each and every single player, we were anticipating some level of dissatisfaction but at the same time, we wanted to offer compensation at the largest possible scale.


    We are very aware of the harm a rollback does to you as players and that is why, every time something goes wrong, we try to avoid it as much as possible.


    Please keep in mind that not only it does harm you as a player, but it also harms us as a company. You might think that giving out Gold as compensation has no impact on us since it is virtual currency we introduced in the game, but the value of Gold is indeed connected to the service we provide of keeping a live game up and running. It is a big loss that we cannot take lightly. Every decision we make takes a multitude of factors into consideration and players are always our number one priority.


    Unfortunately, we cannot make up for the time you lost in the game. We are aware that for a lot of you, this is not “just a game”. And that’s exactly why with this blog post we wanted to give you our side of the story. Our goal is to be as transparent as possible.


    Of course, we are very sorry about the frustration and anger this accident may have caused.


    We are continuously working on improving various processes and we hope you are still going to follow the exciting Travian-adventures that are coming up.


    Thank you for reading this post and even more a big thank you from the Travian: Legends team for all the support you showed during this critical event!


    You gave us the energy to face this challenge… even with some funny memes!

    ridder_huma_sig.png


    Members of the Travian Team works on a voluntary basis and are therefore not available 24 hours a day.