Website Outage - What Happened?


haloman30

Hey, everyone! As you can see, our website is back online!

 

Some of you may (or may not) have noticed that the website was offline for a couple days - all kicked off by an attempted hardware upgrade on the server. If you're interested in the details, keep on reading.

 

How it Started

The hardware upgrade in question was a graphics card - this was intended to allow for the Jenkins server to have more flexibility when it comes to building certain projects - namely, Sandbox. Godot 3.x requires that a GPU be present when building under Windows - and so, unless we wanted to upgrade to Godot 4, we needed graphics acceleration. Part of this process was migrating the Jenkins from its own dedicated server, to a virtual machine (VM) under our hypervisor - which in our case is a separate server running Proxmox. The existing server install was migrated from the original server to its new VM without any issue - but upon shutting everything down to install the graphics card, it was clear that the card was a bit too large, and would not physically fit within the server.

 

As such, it was closed up - with plans to add a new graphics card later on down the line. Ideally, that would've been the end of it - just a brief, 30ish minute outage, and nothing more. Unfortunately, while all other VMs booted up without issue, one of them didn't - and you can probably guess which one it was.

 

Upon realizing the website wasn't accessible, we checked into the VM - and realized it was failing to boot from its virtual hard disk. Further investigation showed that the partition table on the disk was seemingly corrupted - all software believed the disk was completely unallocated. 

 

The Recovery Process

The process first began by trying to restore the original installation of CentOS 7, avoiding any need to reconfigure or reinstall anything. After several attempts, and even some degree of success in restoring the partition table - we got somewhat close, it just wasn't quite enough. We were able to get some partitions to be readable again, but we couldn't get it bootable. We also tried restoring two different Proxmox VM backups - both to no avail. Whatever had happened, it wasn't right on that day - it happened at least a few days ago, and potentially longer ago than that.

 

Given this, and given that CentOS 7 reaches end-of-life in June 2024, we decided instead to try a different approach - which was simply reinstalling the OS entirely. Even if we managed to get the original install back, it wasn't a guarantee that we'd get it working entirely correctly. And considering we had no real idea what caused this issue to begin with - it was also possible that the issue could repeat itself again. And to top it all off - we'd still have to ultimately reinstall the OS regardless in around 6 months time anyways.

 

This in itself was not a simple task either - as it took a few different attempted combinations of operating system and cPanel/WHM version before we found a combination that would work properly. Last night, we finally got things up and running as desired - and overnight, we reimported the automatic backup that WHM created of the cPanel account - which worked almost perfectly out of the box.

 

The Result

After some further reconfiguration, almost everything is exactly as it was - there should be essentially zero data loss - as the backup was from the same day that we performed the original hardware maintenance. Maybe a few hours of lost registrations or logs - but that's it.

 

However, there is one thing that some of you will notice won't work anymore - some of our older archives, primarily on Chaotic United.

 

Any archives of the CU main website or forums are no longer functional - as they require PHP 5.6 - which has been end-of-life for nearly 5 years. The old main website ran under IP.Board 3.4.x, same with the old forum archives - and a few other things here and there.

 

Everything that required an older version of PHP will now display a 403 page, and it will likely remain this way for some time. In the future, we plan to set up a dedicated server, fully isolated from everything else, specifically to run these older websites and keep them up and running. We don't have a timeframe on when this will happen - as currently, I'm quite busy with work and preoccupied with other projects (primarily Sandbox and Blamite) - but you can be rest assured that we've got all of the data perfectly intact still, and that at some point down the line, we'll be bringing this stuff back online.

 

In terms of non-archival stuff, however - you shouldn't notice any issues, and everything should be exactly as it was before this whole mess.

 

 

To wrap up - if you happen to find anything that doesn't seem quite right, be sure to report it on our bug tracker, or let us know on our Discord server - and we'll investigate as soon as we're able. For now, though, that's all we've got. We apologize for this downtime, and we hope that getting things up and running on a fresh VM, with a newer OS, will prevent something like this from happening again.





User Feedback

Recommended Comments

There are no comments to display.



Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now