High Availability & Scaling

High Availability Defined

A characteristic of a system, which aims to ensure an agreed level of operational performance for a higher than normal period. -- Wikipedia

Three principles of HA:

  1. Elimination of single points of failure
  2. Reliable crossover
  3. Detection of failures as they occur


  • This means adding redundancy to the system so that failure of a component does not mean failure of the entire system.
  • In multi-threaded systems, the crossover point itself tends to become a single point of failure. High availability engineering must provide for reliable crossover.
  • If the two principles above are observed, then a user may never see a failure. But the maintenance activity must.

HA Terminology

What is Availability?

Probability that a system is operational at a given time generally given in percentage.

\[\frac { (\text{Time resource was available} - \text{Time resource was unavailable}) } { \text{Total Time} }\]
Ideal is typically five 9s, 99.999%
This gives less than fifty three minutes of downtime per year
A reasonably good goal is 99.9%
This allows for 100x more downtime than five 9s.

Measuring uptime/downtime is hard

Reasons for Un-Availability

Downtime Measuring Example

Consider the following scenario:


Redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system. -- Wikipedia

Redundancy is closely tied to reliability (more redundant systems usually have higher reliability).

Passive Redundancy
Used to achieve high availability by including enough excess capacity in the design to accommodate a performance decline.
Active Redundancy
Used in complex systems to achieve high availability with no performance decline.


Reliability can be defined as the probability that a system will produce correct outputs up to some given time. -- Wikipedia

Testing reliability:

Feature Testing
Checks the features provided by the software or system
Load Testing
Check the performance of the software or system under load
Regression Testing
Check to see if any new bugs have been introduced with previous bug fixes

Single Point of Failure

Traditionally a point with 0 redundancy, often instead means the point in the system with the lowest redundancy value.


Single Point of Failure

Identifying SPOFs is a hard task.

Many places will do fire drills, where a system in staging/pre-production is purposefully taken down so that failure scenarios can be observed, and single points of failure can be identified and fixed.

You can read more about Netflix does this wth Chaos Monkey.

Fault Tolerance

Fault tolerance is the property that enables a system to continue operating in the event of a fault happening.


Examples of HA Systems



You can define scaling as adding more resources to increase performance, reliability, or redundancy.

Two forms:


Horizontal Scaling

Adding more nodes to a system.

Also known as scaling out.


Horizontal Scaling

  • Typically has higher upper bound than vertical scaling
  • Can bring greater increases than vertical scaling
  • Redundancy
  • Expensive
  • Maybe not as much redundancy as you expect
  • Brings more complexity to manage
  • Unused capacity problems (pick: cost or even more complexity)

Horizontal Scaling Complexity

Horizontal scaling increases complexity because:

Vertical Scaling

Adding more resources to a particular node(s)

Also known as scaling up.


Vertical Scaling

  • Easier than horizontal scaling
  • No added complexity
  • Usually cheaper
  • No redundancy (but maybe more reliable)
  • Has a lower upper bound
  • Diminishing returns



Virtual IP

Virtual IP


  • Doesn't handle the replication of data
  • Can't move across subnets
  • Really only good for making an IP address(es) redundant
  • Sometimes ARP can bite you when moving the IP's around


A desirable property of a system which indicates its ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged as demands increase. _images/scalability-comic.png

Truths about Scalability

  1. It won't scale if it's not designed to scale
  2. Even if its designed to scale, there's going to be pain

CAP Theorem

States that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:


CAP Theorem

7 Stages of Scaling Web Applications

  1. The Beginning
  2. More of the same, just bigger
  3. The Pain Begins
  4. The Pain Intensifies
  5. This Really Hurts!
  6. Getting (a little) less painful
  7. Entering the unknown..

Stage 1 -- The Beginning

Stage 2 -- More of the same, just bigger

Stage 3 -- The Pain Begins

Stage 4 -- The Pain Intensifies

Stage 5 -- This Really Hurts!

Stage 6 -- Getting (a little) less painful

Stage 7 -- Entering the unknown...

Where are the remaining bottlenecks?

Stage 7 -- Entering the unknown...

All eggs in one basket?

Good or Best Practices