# High Availability & Scaling

## High Availability Defined

A characteristic of a system, which aims to ensure an agreed level of operational performance for a higher than normal period. -- Wikipedia

Three principles of HA:

1. Elimination of single points of failure
2. Reliable crossover
3. Detection of failures as they occur

Note

• This means adding redundancy to the system so that failure of a component does not mean failure of the entire system.
• In multi-threaded systems, the crossover point itself tends to become a single point of failure. High availability engineering must provide for reliable crossover.
• If the two principles above are observed, then a user may never see a failure. But the maintenance activity must.

## HA Terminology

• Availability
• Redundancy
• Reliability
• Single point of failure (SPOF)
• Fault Tolerance

## What is Availability?

Probability that a system is operational at a given time generally given in percentage.

$\frac { (\text{Time resource was available} - \text{Time resource was unavailable}) } { \text{Total Time} }$
Ideal is typically five 9s, 99.999%
This gives less than fifty three minutes of downtime per year
A reasonably good goal is 99.9%
This allows for 100x more downtime than five 9s.

Measuring uptime/downtime is hard

## Reasons for Un-Availability

• Physical hardware
• Network infrastructure
• Operating system
• Application
• Physical location
• Redundancy cross-over

## Downtime Measuring Example

Consider the following scenario:

• You run an OpenStack cluster. One day, your authentication API goes down.
• All of the customers existing services continue running (VMs stay up, etc).
• What is the downtime?

## Redundancy

Redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system. -- Wikipedia

Redundancy is closely tied to reliability (more redundant systems usually have higher reliability).

Passive Redundancy
Used to achieve high availability by including enough excess capacity in the design to accommodate a performance decline.
Active Redundancy
Used in complex systems to achieve high availability with no performance decline.

## Reliability

Reliability can be defined as the probability that a system will produce correct outputs up to some given time. -- Wikipedia

Testing reliability:

Feature Testing
Checks the features provided by the software or system
Check the performance of the software or system under load
Regression Testing
Check to see if any new bugs have been introduced with previous bug fixes

## Single Point of Failure

Traditionally a point with 0 redundancy, often instead means the point in the system with the lowest redundancy value.

Examples:

• Single load balancer with multiple web nodes
• Single database node
• Network switch
• Non-redundant power
• Power distribution and configuration
• Geo-location

## Single Point of Failure

Identifying SPOFs is a hard task.

Many places will do fire drills, where a system in staging/pre-production is purposefully taken down so that failure scenarios can be observed, and single points of failure can be identified and fixed.

## Fault Tolerance

Fault tolerance is the property that enables a system to continue operating in the event of a fault happening.

• Redundancy is a part of the Fault Tolerance
• Redundancy generally refers to a component while Fault tolerance refers to a system-wide ability to deal with faults

Example:

• RAID is Fault Tolerant
• The hard drives are redundant

## Examples of HA Systems

• Resource Manager -- Pacemaker
• Messaging Layer -- Corosync or Heartbeat
• Resource Agents
• Data replication -- DRBD, GlusterFS, etc
• Database Replication
• Load balancers -- HAproxy, Varnish, etc

# Scaling

## Scaling

You can define scaling as adding more resources to increase performance, reliability, or redundancy.

Two forms:

• Horizontal
• Vertical

## Horizontal Scaling

Adding more nodes to a system.

Also known as scaling out.

Examples:

• Adding a second (or third, etc) database node

## Horizontal Scaling

Pros:
• Typically has higher upper bound than vertical scaling
• Can bring greater increases than vertical scaling
• Redundancy
Cons:
• Expensive
• Maybe not as much redundancy as you expect
• Brings more complexity to manage
• Unused capacity problems (pick: cost or even more complexity)

## Horizontal Scaling Complexity

Horizontal scaling increases complexity because:

• Requires load balancing, replication, etc
• Budgeting for peak load + X% can leave a lot of unused capacity
• Managing lots of nodes is harder than managing fewer nodes

## Vertical Scaling

Adding more resources to a particular node(s)

Also known as scaling up.

Examples:

## Vertical Scaling

Pros:
• Easier than horizontal scaling
• Usually cheaper
Cons:
• No redundancy (but maybe more reliable)
• Has a lower upper bound
• Diminishing returns

## Virtual IP

• Doesn't correspond to a particular physical nic
• Shared between many nics across different machines (and one nic can have multiple addrs)
• Can be moved across any other ip on the same subnet
• Variety of implementations: keepalived, carp and ucarp

## Virtual IP

Limitations:

• Doesn't handle the replication of data
• Can't move across subnets
• Really only good for making an IP address(es) redundant
• Sometimes ARP can bite you when moving the IP's around

## Scalability

A desirable property of a system which indicates its ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged as demands increase.

1. It won't scale if it's not designed to scale
2. Even if its designed to scale, there's going to be pain

## CAP Theorem

States that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

• Consistency: All nodes see the same data at the same time
• Availability: A guarantee that every request receives a response about whether it succeeded or failed
• Partition Tolerance: the system continues to operate despite arbitrary partitioning due to network failures

PICK TWO

## 7 Stages of Scaling Web Applications

1. The Beginning
2. More of the same, just bigger
3. The Pain Begins
4. The Pain Intensifies
5. This Really Hurts!
6. Getting (a little) less painful
7. Entering the unknown..

## Stage 1 -- The Beginning

• Simple Architecture
• Pair of web servers
• Database Server
• Internal Storage
• Low complexity and overhead means quick development and lots of features, fast
• No redundancy, low operational cost -- great for startups

## Stage 2 -- More of the same, just bigger

• Business is becoming successful -- risk tolerance low
• Add more web servers for performance
• Scale up the database and optimize
• Still relatively simple from an application perspective

## Stage 3 -- The Pain Begins

• Publicity hits (Reddit, Hacker News, etc)
• Setup reverse caching proxies (Varnish) -- to cache static content
• Add even more web servers (Managing content becomes painful)
• Single database can't cut it anymore
• All writes go to a single master server with read-only slaves
• May require some re-coding of the application

## Stage 4 -- The Pain Intensifies

• Caching with memcached
• Replication doesn't work for everything
• Single "writes" database
• Too many writes
• Replication takes too long
• Database partitioning starts to make sense
• Certain features get their own database
• Shared storage makes sense for content
• Requires significant re-architecting of the application and data base

## Stage 5 -- This Really Hurts!

• Panics sets in. Hasn't anyone done this before?
• Re-thinking entire application / business model
• Why didn't we architect this thing for scale?
• Can't just partition on features -- what else can we use?
• Partitioning based on geography, last name, user ID, etc
• Create user-clusters
• All features available on each user-cluster
• Use a hashing scheme or master DB for locating which user belongs to which cluster

## Stage 6 -- Getting (a little) less painful

• Scalable application and database architecture
• Acceptable performance
• Starting to add new features again
• Optimizing some of the code
• Still growing, but it's managable

## Stage 7 -- Entering the unknown...

Where are the remaining bottlenecks?

• Power, Space
• Bandwidth, CDN, Hosting provider big enough?
• Storage
• People and process
• Database technology limits -- scalable, key-value store anyone?

## Stage 7 -- Entering the unknown...

• Single datacenter
• Single instance of the data
• Difficult to replicate data and load balance geographically

## Good or Best Practices

• Don't re-invent the wheel, copy someone else
• Think simplicity
• Think horizontal, not vertical, on everything
• Use commodity equipment
• Make troubleshooting easy
• Don't spend your time over-optimizing