Getting a £60M/year ecommerce site back to life in 10 minutes

Real life story

You get back to home after a good day at work; it’s christmas time so business is doing good, servers are getting busy, but everything seems normal, working as expected (let’s say… inside the OK thresholds). But when you are relaxed, at home after a good day at work, maybe preparing some food, just having a beer, that scary sound: our friend Nagios, the systems overlord.

You grab your phone, wishing it’s only just some service coming a bit more busy than expected (OK, it’s Xmas! can happen!) but you see something like: CRITICAL – your images servers are loosing 60% of the packets in the network. PANIC. You think it cannot be possible, those servers were coping fine with the traffic, maybe it’s just the connection to the monitoring server, so you connect to the website and… MORE PANIC. Any page takes more than 10 seconds to load and it doesn’t even have all the images, Chrome is displaying some funny icon instead of that pair of jeans over that gorgeous model.

So first, you check Analytics… Yeah, we are having some big traffic, is it that bad? At the same time you try to think what is happening, the face of your CEO gets to your head, armed with a shotgun, he will definitely kill me, also, if I survive, I will need a new job.

Quick check on the monitoring graphs, something is going on the network, but doesn’t seem to be that bad… Let’s call the Data Centre: oh, yes, the firewall is dying, that old forgotten server is dying too (let’s face it, you have forgotten that old server, it was serving just images!!). The provider might be able to provision a new firewall, but it won’t be ready for another hour or more, and replacing it will be a contractual nightmare.

You have the Head of Web Development on the line, he is ready to move the images to use another domain under the main servers, but it will definitely cause more problems. We are running out of time, xmas is here and the CEO is loading shells on his shotgun, so we switched it quickly. The site is back but definitely slower as we are grabbing power from the app servers just for serving images, putting pressure on the rest of the network, SAN and firewalls. Did I say that it was the biggest day of the year? Maybe the biggest day in the company’s history? That we needed maximum speed? Did I mention the shotgun?

So, let’s call the sky, Cloud Power!!!

Amazon saved my life: let’s create a new Amazon CloudFront distribution, and configure the DNS… Come on CloudFront!! 5 Minutes?!?! Really?? Just for setting up a complete ad-hoc CDN, being deployed on around 30 POPs around the world? 5 minutes!! Yes, I’m an speed addict, but, thinking a bit more about it, 5 minutes is OK for setting all that :).

All looks good. Hey, Head of Dev, the CDN is now ready :D, can you run some quick checks? All looks good for him, so, please, send it to production!!

After a minute, we are serving our images from the CDN now, cache seems to be working nice, let’s give it some minutes so it gets warm… We experience an incredible ramp of speed on the site! When you see 200 products per page, the images of the products load now even quicker than the rest of the site!! This looks very good.

Hearing the Head of Dev singing and saying “Ohh!! That’s f*** fast!! Look at that!!” is almost priceless, so your day starts to become to a good ending; tomorrow, instead of starting the search for a new job, you will be like “Hey! Had fun last night?! Thanks guys!” :). So, you send some humble, low profile email to the CEO, saying something like “Oh, yes, I was just passing around and as the server was not coping as expected (ehem… burning like hell), we have deployed a complete world-wide CDN solution. Oh, yes, it’s OK, normal everyday Sysadmin job, nothing to worry about” and you have just saved your life, your £60M business and avoided that awful email asking explanations of what happened last night and what solutions are you providing in order to avoid this kind of issues next time.

Now, it’s time to grab another beer (or maybe some single malt scotch?) and say “Prost! Thank you Amazon, I love The Cloud!!”, and start realizing that you have made an impressive infrastructure change in just 10 minutes.

Just another normal day at the Systems Department.

Cloud is here, don’t be afraid of it, it will probably speed up your business, save money, and keep Sysadmins (and CEOs) happy.

 

Cloud & Bolt image from University of New England, Maine.