https://www.getfilecloud.com/blog/an-introduction-to-high-availability-architecture/

An introduction to High Availability Architecture

In the real world, there can be situations when a dip in performance of your servers might occur from events ranging from a sudden spike in traffic can lead to a sudden power outage. It can be much worse and your servers can be crippled- irrespective of whether your applications are hosted in the cloud or a physical machine. Such situations are unavoidable. However, rather than hoping that it doesn’t occur, what you should actually do is to gear up so that your systems don’t encounter failure.

The answer to the problem is the use of High Availability (HA) configuration or architecture. High availability architecture is an approach of defining the components, modules or implementation of services of a system which ensures optimal operational performance, even at times of high loads. Although there are no fixed rules of implementing HA systems, there are generally a few good practices that one must follow so that you gain the most out of the least resources.

Why do you need it?

Let us define downtime before we move further. Downtime is the period of time when your system (or network) is not available for use, or is unresponsive. Downtime can cause a company huge losses, as all their services are put on hold when their systems are down. In August 2013, Amazon went down for 15 minutes (both web and mobile services), and ended up losing over $66000 per minute. Those are huge numbers, even for a company of the size of Amazon.

There are two types of downtimes- scheduled and unscheduled. A scheduled downtime is a result of maintenance, which is unavoidable. This includes applying patches, updating softwares or even changes in the database schema. An unscheduled downtime is, however, caused by some unforeseen event, like hardware or software failure. This can happen due to power outages or failure of a component. Scheduled downtimes are generally excluded from performance calculations.

The prime objective of implementing High Availability architecture is to make sure you system or application is configured to handle different loads and different failures with minimal or no downtime. The are multiple components that help you in achieving this, and we will be discussing them briefly.

How is availability measured?

Organizations who plan to fully utilize a cloud infrastructure must also be capable of meeting demands for 24/7 availability. Availability can be measured as the percentage of time that systems are available.

x = (n – y) * 100/n

Where n is the total number of minutes in a calendar month and y is the total number of minutes that service is unavailable in the given calendar month. High availability simply refers to a component or system that is continuously operational for a desirably long period of time. The widely-held but almost impossible to achieve standard of availability for a product or system is referred to as ‘five 9s’ (99.999 percent) availability. High availability is a requirement for any enterprise that hopes to protect their business against the risks brought about by a system outage. These risks, can lead to millions of dollars in revenue loss.

Is it really worth the money?

The fact that going for high availability architecture gives you higher performance is all right, but it comes at a big cost too. You must ask yourself if you think the decision is justified from the point of view of finance.

A decision must be made on whether the extra uptime is truly worth the amount of money that has to go into it. You must ask yourself how damaging potential downtimes can be for your company and how important your services are in running your business.

How do we achieve it?

Now that you have decided to go with it, let’s discuss the ways to implement it. Non-intuitively, adding more components to a system doesn’t help in making it more stable and achieve high availability. It can actually lead to the opposite as more components increases the probability of failures. Modern designs allow for distribution of the workloads across multiple instances such as a network or a cluster, which helps in optimizing resource use, maximizing output, minimizing response times and avoiding overburden of any system in the process known as load balancing. It also involves switching to a standby resource like a server, component or network in the case of failure of an active one, known as Failover systems.

1. Use of Multiple Application Servers:

Imagine that you have a single server to render your services and a sudden spike in traffic leads to its failure (or crashes it). In such a situation, until your server is restarted, no more requests can be served, which leads to a downtime.

The obvious solution here is to deploy your application over multiple servers. You need to distribute the load among all these, so that none of them are overburdened and the output is optimum. You can also deploy parts of your application on different servers. For instance, there could be a separate server for handling mails or a separate one for processing static files like images (like a Content Delivery Network).

2. Scaling Databases:

Databases are the most popular and perhaps one of the most conceptually simple ways to save user data. One must remember that databases are equally important to your services as your application servers. Databases run on separate servers (like the Amazon RDS) and are prone to crashes as well. What’s worse is that databases crashes can lead to a loss of user data, which can prove to be costly.

Redundancy is a process which creates systems with high levels of availability by achieving failure detectability and avoiding common cause failures. This can be achieved by maintaining slaves, which can step in if the main server crashes. Another interesting concept of scaling databases is sharding. A shard is a horizontal partition in a database, where rows of the same table  which is then run on a separate server.

3. Diversified Geographical Locations:

Scaling up your applications and then your databases is a really big step ahead, but what if all the servers are at the same physical location and something terrible like a natural disaster affects the data center at which your servers are located? This can lead to potentially huge downtimes.

It is, therefore, imperative that you keep your servers in different locations. Most modern web services allow you to select the geographical location of your servers. You should choose wisely to make sure your servers are distributed all over the world and not localized in an area.

Within this post, I have tried to touch upon the basic ideas that form the idea of high availability architecture. In the final analysis, it is evident that no single system can solve all the problems. Hence, you need to assess your situation carefully and decide on what options suit them best. Here’s hoping that this introduced you to the world of high availability architecture and helped you decide how to go about achieving this for yourself.

 

What are the best practices?

In order to curb system failures and keep both planned and unplanned downtimes at bay, the use of a High Availability (HA) architecture is highly recommended, especially for mission-critical applications. Availability experts insist that for any system to be highly available, its parts should be well designed and rigorously tested. The design and subsequent implementation of a high availability architecture can be difficult given the vast range of software, hardware and deployment options. However, a successful effort typically starts with distinctly defined and comprehensively understood business requirements. The chosen architecture should be able to meet the desired levels of security, scalability, performance and availability.

The only way to guarantee compute environments have a desirable level of operational continuity during production hours is by designing them with high availability. In addition to properly designing the architecture, enterprises can keep crucial applications online by observing the recommended best practices for high availability.

  1. Data Backups, Recovery and Replication

The hallmark of a good data protection plan that protects against system failure is a sound backup and recovery strategy. Valuable data should never be stored without proper backups, replication or the ability to recreate the data. Every data center should plan for data loss or corruption in advance. Data errors may create customer authentication issues, damage financial accounts and subsequently business community credibility. The recommended strategy for maintaining data integrity is creating a full backup of the primary database then incrementally testing the source server for data corruptions. Creating full backups is at the forefront of recovering from catastrophic system failure.

  1. Clustering

Even with the highest quality of software engineering, all application services are bound to fail at some point. High availability is all about delivering application services regardless of failures. Clustering can provide instant failover application services in the event of a fault. An application service that is ‘cluster aware’ is capable of calling resources from multiple servers; it falls back to a secondary server if the main server goes offline. A High Availability cluster includes multiple nodes that share information via shared data memory grids. This means that any node can be disconnected or shutdown from the network and the rest of the cluster will continue to operate normally, as long as at least a single node is fully functional. Each node can be upgraded individually and rejoined while the cluster operates. The high cost of purchasing additional hardware to implement a cluster can be mitigated by setting up a virtualized cluster that utilizes the available hardware resources.

  1. Network Load Balancing

Load balancing is an effective way of increasing the availability of critical web-based applications. When server failure instances are detected, they are seamlessly replaced when the traffic is automatically redistributed to servers that are still running. Not only does load balancing lead to high availability it also facilitates incremental scalability. Network load balancing can be accomplished via either a ‘pull’ or a ‘push’ model. It facilitates higher levels of fault tolerance within service applications.

  1. Fail Over Solutions

High availability architecture traditionally consists of a set of loosely coupled servers which have failover capabilities. Failover is basically a backup operational mode in which the functions of a system component are assumed by a secondary system in the event that the primary one goes offline, either due to failure or planned down time. A ‘cold failover’ occurs when the secondary server is only started after the primary one has been completely shut down. A ‘hot failover’ occurs when all the servers are running simultaneously, and the load is directed entirely towards a single server at any given time. In both scenarios, tasks are automatically offloaded to a standby system component so that the process remains as seamless as possible to the end user. Failover can be managed via DNS, in an well-controlled environment.

  1. Geographic redundancy

Geo-redundancy is the only line of defense when it comes to preventing service failure in the face of catastrophic events such as natural disasters that cause system outages. Like in the case of geo-replication, multiple servers are deployed at geographical distinct sites. The locations should be globally distributed and not localized in a specific area. It is crucial to run independent application stacks in each of the locations, so that in case there is a failure in one location, the other can continue running. Ideally, these locations should be completely independent of each other.

  1. Plan for failure

Despite the fact that applying the best practices for high availability is essentially planning for failure; there are other actions an organization can take to increase their preparedness in the event of a system failure leading to downtime. Organizations should keep failure or resource consumption data that can be used to isolate problems and analyze trends. This data can only be gathered through continuous monitoring of operational workload. A recovery help desk can be put in place to gather problem information, establish problem history, and begin immediate problem resolutions. A recovery plan should not only be well documented but also tested regularly to ensure its practicality when dealing with unplanned interrupts. Staff training on availability engineering will improve their skills in designing, deploying, and maintaining high availability architectures. Security policies should also be put in place to curb incidences of system outages due to security breaches.

Example: FileCloud High Availability Architecture 
Following diagram explains how FileCloud servers can be configured for High Availability to improve service reliability and reduce downtime. Click here for more details.

An introduction to High Availability Architecture的更多相关文章

  1. Hadoop官方文档翻译—— YARN ResourceManager High Availability 2.7.3

    ResourceManager High Availability (RM高可用) Introduction(简介) Architecture(架构) RM Failover(RM 故障切换) Rec ...

  2. Pattern: Microservice Architecture

    Microservice Architecture pattern http://microservices.io/patterns/microservices.html Context You ar ...

  3. High Availability (HA) 和 Disaster Recovery (DR) 的区别

    High availability 和disaster recovery不是一回事. 尽管在规划和解决方案上有重叠的部分, 它们俩都是business contiunity的子集. HA的目的是在主数 ...

  4. [place recognition]NetVLAD: CNN architecture for weakly supervised place recognition 论文翻译及解析(转)

    https://blog.csdn.net/qq_32417287/article/details/80102466 abstract introduction method overview Dee ...

  5. Java性能提示(全)

    http://www.onjava.com/pub/a/onjava/2001/05/30/optimization.htmlComparing the performance of LinkedLi ...

  6. Storage Systems topics and related papers

    In this post, I will distill my own ideas and my own views into a structure for a storage system cou ...

  7. Network Load Balancing Technical Overview--reference

    http://technet.microsoft.com/en-us/library/bb742455.aspx Abstract Network Load Balancing, a clusteri ...

  8. SAP HANA学习资料大全[非常完善的学习资料汇总]

    Check out this SDN blog if you plan to write HANA Certification exam http://scn.sap.com/community/ha ...

  9. (转)Awesome Courses

    Awesome Courses  Introduction There is a lot of hidden treasure lying within university pages scatte ...

随机推荐

  1. Centos安装Oracle数据库文本记录

    题记,本文旨在记录图形化安装过程,的过程...仅仅是回忆性学习... oracle账号登陆图形界面    #没有图形化,图形检查不通过 运行终端 Terminal cd /u01/database . ...

  2. Ubuntu18.04中配置QT5.11开发环境

    准备工作 参考 https://wiki.qt.io/Install_Qt_5_on_Ubuntu . # 安装g++ sudo apt install build-essential # sudo ...

  3. 有关windows Gateway Ipsec 和NAT 兼容性问题

    1.简单通信拓扑: 将Windows 平台 作为一个网关,同一时候开启IPsec 和NAT来支持private和public的通信. 注意:IPSEC Gateway  和 Client1 Ipsec ...

  4. k8s oomkilled超出容器的内存限制

    超出容器的内存限制 只要节点有足够的内存资源,那容器就可以使用超过其申请的内存,但是不允许容器使用超过其限制的 资源.如果容器分配了超过限制的内存,这个容器将会被优先结束.如果容器持续使用超过限制的内 ...

  5. LUA可变长参数 ... 三个点

    本文翻译自 LUA官方文档 When a function is called, the list of arguments is adjusted to the length of the list ...

  6. keras embeding设置初始值的两种方式

    随机初始化Embedding from keras.models import Sequential from keras.layers import Embedding import numpy a ...

  7. C语言中的随意跳转

    C语言中有一个很不常用的头文件:setjmp.h. 这个头文件是C语言底层实现的,不像math.h里面的函数都是纯C语言实现的. setjmp.h包含两个函数: longjmp 跳转到某个位置 set ...

  8. Debug 路漫漫-05

    Debug 路漫漫-05: 1.使用这种方式计算 AUC 指标,结果出来居然是 NAN, —— 分母为(M*N),M或者N必有一个为0 了.(nan出现的情况绝大部分是分母出现0了)   若分子为0的 ...

  9. Yslow---一款很实用的web性能测试插件

    YSlow可以对网站的页面进行分析,并告诉你为了提高网站性能,如何基于某些规则而进行优化. YSlow可以分析任何网站,并为每一个规则产生一个整体报告,如果页面可以进行优化,则YSlow会列出具体的修 ...

  10. C语言学习笔记 (002) - C++中引用和指针的区别(转载)

    下面用通俗易懂的话来概述一下: 指针-对于一个类型T,T*就是指向T的指针类型,也即一个T*类型的变量能够保存一个T对象的地址,而类型T是可以加一些限定词的,如const.volatile等等.见下图 ...