Number 9, Number 9, Number 9 (Evolving High Availability at Braintree by Lionel Barrow)

4 Sherpas took an adventure to Toronto for the Full Stack Toronto Conference 2015, these are some of the ?things they learned...




In the financial world, when you have clients that depend on you to handle their transactions, downtime is lost money. Not the proverbial “lost money” actually money that was supposed to go to your client that you didn’t make happen.

High availability is a measurement of how often your downtime happens. Don’t get me wrong, downtime is important; that’s where good things take place for your system – updating needs to interrupt your services. But, and I mean BUT – downtime is also caused by things you don’t control.

 

Tonight’s Main Event!
Chaos vs. Infrastructure!!!

/_uploads/images/contenthub-posts/08-2017/3E47FAF1EBFDE34D21D25198596C3E0126F3963AA4A4EB6FF7pimgpsh_fullsize_distr.jpg

 

Just like the real world, infrastructure changes, or needs repair, and sometimes destruction and rerouting.
What worked for many years might not work the most efficiently today. After weighing out options, changes occur; but these changes can only take place if you take your services down. The idea is to take your services down for as minimal amount of time as possible.

And then Chaos.

Recently, an Instagram engineer has reported that he now knows Justin Beiber’s account number by heart. This is the lesser important part of this report – the engineer knows this number because this account is the root cause of several instances of insanely high site/service traffic. JB posts a new picture and immediately millions of his followers make that picture their most favourite thing in the whole wide world. Good for JB, bad for Instagram users since it basically makes the site data transfer crawl, or worse, take the site down making it impossible to access the website. This is what is known as a Distributed Denial of Service – a common form of “hack” to take websites down, but can actually happen with large amounts of users.

Dressing to the 9’s.

 A site or service uptime is classified as a percentage of availability throughout the year, month, week, and day.

 

/_uploads/images/contenthub-posts/08-2017/9s.png

 

A website is not too unlike a restaurant, when it gets popular, it gets more traffic. When customers try to visit you and provide you with revenue (or simply visit for their own reasons,) but they can’t get through the door, or have any poor service; expect a poor Yelp review.

Rarely people say to their friends “I always go to this restaurant, and it is always amazing…,” but the second they experience something wrong, they make sure they tell everyone they know. As we now know, a company’s image is their lifeblood.

Preparing for High Availability ensures that your site or service is ready for anything and can help protect your company image from becoming tarnished.

Capacity Planning is first and foremost the best starting point. Implement Load Balancing; a simple process of ensuring that not all traffic is being pushed upon one server. When your site metrics signify that there is a growth of usage, it is time to implement another server that can spread the bandwidth. Spread the wealth!

Braintree uses load balancing as their first line of defense, they also USED TO cheat when it came to HA, I don’t know what they really do now, it’s still secretive with their internal proxies and such; but it used to be the illusion of uptime.

All transactions would be stored in a queue so that if the service went down the transactions could still take place. The illusion of the transaction would still take place, and the service would return to the requestor, then when the service would return , the queue would be pulled from and one after the other the transactions would be completed. We’re talking microseconds here, and if it took any longer, the site response would just seem slow – whereas it was actually not available. You just wouldn’t know.

Even though they have ditched this process of transaction management, it is still a viable process. If a transaction can be queued, why not do so and let the website continue being a website instead of a data processing station.

       
         
       
         
       
         
       
         

Related Posts