What Happened
On Friday, July 8th, we started to see timeouts that were unclear at the time, but in retrospect, were signs of a pending hardware failure. Over the next few days, these issues quietly got worse, and by Tuesday, July 12th, users began to experience more widespread error pages and time-outs. Upon further investigation, we learned that multiple hardware solutions were failing simultaneously, an incredibly rare incident that could not have been predicted. To make matters worse, we learned that the equipment we ordered to be shipped to us overnight for Thursday delivery was delayed at FedEx until Monday. Talk about bad luck!
What We Did to Solve the Problem
Late in the week, it became immediately apparent that we needed to replace the failing hardware. However, the process to do so is considerably extended when users are still accessing the site, so we made the difficult decision to put the site in maintenance mode and limit user access. To mitigate customer impact, we chose to do so during the off hours of Thursday evening and Friday evening-Monday morning. We also took strides to convey the planned maintenance-related shutdowns to our customers via multiple communication channels and keep you updated on our progress.
The maintenance periods gave us the best shot at limiting further customer impact by allowing our team as much time as possible to implement the repairs. While in maintenance mode, the failing hardware was replaced, our disaster recovery processes were tested, and databases were rebuilt. Furthermore, we eliminated the risk of data corruption and data loss in backups caused by hardware failure thanks to our various systems for backing up customer data and logging changes. We know a few of you had issues Monday, but we quickly implemented fixes to correct them that evening. If you are still experiencing any problems, long lag times, or errors, please report them immediately in our chat channel or at helpdesk@share-builders.com.
What We ‘re Doing to Make Sure It Doesn ‘t Happen Again
First, while our early warning systems did indicate that trouble was brewing, these monitoring processes can always be improved. So we ‘re implementing changes to put this information in the hands of the people who need it as soon as possible. Additionally, we are looking to automate and run a regular process that looks into every database to identify data corruption. The goal is that if hardware monitoring misses the cause of corruption, early detection in the data could be a secondary or even earlier check.
Lastly and this is the big one this entire experience made it clear that we must accelerate our timeline for moving from our current data center to the cloud. This has been on our roadmap for months, but we deprioritized it in favor of releasing new features that would be more readily apparent to the everyday user of ShareBuilder CRM. Moving to the cloud is like getting new car tires: you have to do it eventually and your car will perform better when you do but it ‘s not the kind of investment anyone gets excited about or will even notice.
Like any company, we have limited resources, so prioritizing the cloud move means pushing other planned new features further down our timeline. We think this might delay things about a month or so, so it ‘s not a significant shift, and it will allow us to deliver better features later on down the line and keep our focus on the new, not on hardware.
Click here to see our roadmap for what ‘s to come over the next few months.
We think this gives you a complete picture of the details of the incident last week, but if you have any questions or concerns, please reach out at helpdesk@share-builders.com. Thank you for your time and your partnership.