HOSTING BILLING SIGN UP! ABOUT


What the heck happened?

What happened that knocked our server offline?

      On Sunday (4/30) around 2 PM Central I was paged that one of my servers was behaving erratically and rebooting itself.  I checked on it and did not find anything glaringly wrong in the log files but continued to monitor the situation.  I also contacted my server management company to have a look at the situation.  They did not respond for several hours.

     I continued to watch the server and restart failing services as I tried to track down the root source of the problem.  It appeared that someone was trying to hack in to the server using well-known Windows exploits, which are not at all useful against the Linux machines that we run, but somehow the login attempts and the reboots seemed to be connected.

     The server managers finally replied around 5 or 6.  They took over the operation and I stepped back, confident that the issue would be solved.  After a while they gave up and told me to contact the datacenter.

     After more diagnostics I determined that the hard drive on the server was dying.  I contacted my data center and asked them to put a new hard drive in the server with the old one as a secondary so I could get the data off without resorting to the backups that would be several hours out of date.  I finally went to bed around 4 AM on Sunday morning.

     At 6 AM I was back up and headed to the shop.  I had not received a reply from the datacenter, but I discovered that my office phones were out of commission.  You may have seen something on CNN about a huge storm in Texas that knocked planes into each other in Gainesville last Saturday night - that's just north of here.  I know that some of you tried to call early Monday morning because your calls were routed to my cellphone but I'm sure others were dropped due to that localized outage.

     The datacenter took their time putting the new hard drive in.  They determined that the drive was in fact faulty and so was the power supply.  Both were replaced, but it was well after 2 PM before I was able to access the server again.  I began copying files over from the old drive to the new and setting up the server.  I determined quickly that #1, the old hard drive was dying fast, and #2, the new hard drive was not set up properly.  There were some files from version 3 of the operating system and some from version 4.  I could not get the server functional.

     Rather than wait for the datacenter techs to attempt another reload, I rented a KVM box from them that allowed me remote access to the server almost as if I were sitting in front of it.  I thought this would be a quick thing when I filled in the form.  After some miscommunication and outright screwups on their part, I was finally able to access the server completely between 11PM and midnight on Monday 5/1.  I installed the operating system correctly and began uploading backups from my backup server.  With nothing to do but wait and watch the uploads go, I left the office at 4:30 Tuesday (5/2) morning.  I was walking back in the front door by 7.

    Tuesday was spent configuring the server, uploading more backups, restoring user data, etc.  99% of the accounts on the server were operational by Tuesday when I left between 11 and 12 PM, the only exception being one very large account that was still uploading.  I set a task on a timer to restore his account when the upload finished and went home to get some sleep.

     So, what caused this and how will it be prevented in the future?

    The root cause was hardware failure, which is just a fact of life - it happens.  Sometimes drives fail after a few months and sometimes they last for decades.  Compounding factors have been slow response times from my datacenter and server management and miscommunication all around.

     Effective as soon as possible, all Blue Note servers will have backup hard drives in a mirrored RAID configuration.  If one fails, the other will keep on in its place and the faulty drive can be replaced when it's convenient.  In addition, we have contracted with a different, hopefully better, certainly more expensive server management company to monitor the servers and watch for early warning signs of this or other issues.

     All clients hosted on the beethoven server will receive one month of hosting free of charge.  If you have multiple packages with us, this applies to all packages hosted on the beethoven server.  I personally went into the billing system yesterday and adjusted everyone's renew date to be one month later.

     I will be personally visiting each of your sites to look for any problems related to the outage and will correct them at no cost to you.  It's going to be a busy week.  Those of you with online stores will be first on the list, ordered by the date you signed up with my company.  Please understand that most of my clients on this server have online stores and it's going to be a busy week.

     I value your business and would be humbled if you choose to continue as my clients.  Some of you have been with me from the start and I know you by name if not by face - but I would certainly understand if you chose to take your business elsewhere.  An outage like this is unacceptable in the web hosting business.  Despite the upgrades that I have planned I do not plan to increase prices for my current clients.

    If you have any questions or concerns, please do not hesitate to use the contact form on this page or email me directly.  As always, thanks for your business.