4 Sherpas took an adventure to Toronto for the Full Stack Toronto Conference 2015, these are some of the ?things they learned...
In the financial world, when you have clients that depend on
you to handle their transactions, downtime is lost money. Not the proverbial
“lost money” actually money that was supposed to go to your client that you
didn’t make happen.
High availability is a measurement of how often your
downtime happens. Don’t get me wrong, downtime is important; that’s where good
things take place for your system – updating needs to interrupt your services.
But, and I mean BUT – downtime is also caused by things you don’t control.
Tonight’s
Main Event!
Chaos vs. Infrastructure!!!

Just like the real world, infrastructure changes, or needs
repair, and sometimes destruction and rerouting.
What worked for many years might not work the most
efficiently today. After weighing out options, changes occur; but these changes
can only take place if you take your services down. The idea is to take your
services down for as minimal amount of time as possible.
And then Chaos.
Recently, an Instagram engineer has reported that he now
knows Justin Beiber’s account number by heart. This is the lesser important
part of this report – the engineer knows this number because this account is
the root cause of several instances of insanely high site/service traffic. JB
posts a new picture and immediately millions of his followers make that picture
their most favourite thing in the whole wide world. Good for JB, bad for
Instagram users since it basically makes the site data transfer crawl, or
worse, take the site down making it impossible to access the website. This is
what is known as a Distributed Denial of Service – a common form of “hack” to
take websites down, but can actually happen with large amounts of users.
Dressing
to the 9’s.
A site or service uptime is classified as a percentage of
availability throughout the year, month, week, and day.

A website is not too unlike a restaurant, when it gets
popular, it gets more traffic. When customers try to visit you and provide you
with revenue (or simply visit for their own reasons,) but they can’t get
through the door, or have any poor service; expect a poor Yelp review.
Rarely people say to their friends “I always go to this
restaurant, and it is always amazing…,” but the second they experience
something wrong, they make sure they tell everyone they know. As we now know, a
company’s image is their lifeblood.
Preparing for High Availability ensures that your site or
service is ready for anything and can help protect your company image from
becoming tarnished.
Capacity Planning is first and foremost the best starting
point. Implement Load Balancing; a simple process of ensuring that not all
traffic is being pushed upon one server. When your site metrics signify that
there is a growth of usage, it is time to implement another server that can
spread the bandwidth. Spread the wealth!
Braintree uses load balancing as their first line of
defense, they also USED TO cheat when it came to HA, I don’t know what they
really do now, it’s still secretive with their internal proxies and such; but
it used to be the illusion of uptime.
All transactions would be stored in a queue so that if the
service went down the transactions could still take place. The illusion of the
transaction would still take place, and the service would return to the
requestor, then when the service would return , the queue would be pulled from
and one after the other the transactions would be completed. We’re talking
microseconds here, and if it took any longer, the site response would just seem
slow – whereas it was actually not available. You just wouldn’t know.
Even though they have ditched this process of transaction
management, it is still a viable process. If a transaction can be queued, why
not do so and let the website continue being a website instead of a data
processing station.