Designing fault tolerant systems is extremely difficult.  You can try to anticipate and reason about all of the things that can go wrong with your software and code defensively for these situations, but in a complex system it is very likely that some combination of events or inputs will eventually conspire against you to cause a failure or bug in the system.

In certain areas of the software community such as Erlang and Akka, there’s a philosophy that rather than trying to handle and recover from all possible exceptional and failure states, you should instead simply fail early and let your processes crash, but then recycle them back into the pool to serve the next request.  This gives the system a kind of self healing property where it recovers from failure without ceremony, whilst freeing up the developer from overly defensive error handling.

I believe that implementing let it crash semantics and working within this mindset will improve almost any application – not just real time Telecoms system where Erlang was born.  By adopting let it crash, redundancy and defence against errors will be baked into the architecture rather than trying to defensively anticipate scenarios right down in the guts of the code.  It will also encourage you to implement more redundancy throughout your system.

Also ask yourself, if the components or services in your application did crash, how well would your system recover with or without human intervention?  Very few applications will have a full automatic recoverability property, and yet implementing this feels like relatively low hanging fruit compared to writing 100% fault tolerant code.

So how do we start to put this into practice?

At the hardware level, you can obviously look towards the ‘Google model’ of commodity servers, whereby the failure of any given server supporting the system does not lead to a fatal degradation of service.  This is easier in the cloud world where the economics encourage us to use a larger number of small virtualised servers.   Just let them crash  and design for the fact that servers can die at a moments notice.

Your application might be comprised of different logical services. Think a user authentication service or a shopping cart system. Design the system to let entire services crash . Where appropriate, your application should be able to proceed and degrade gracefully whilst the service is not available, or to fall back onto another instance of the service whilst the first one is recycling.  Nothing should be in the critical code path because it might crash!

Ideally, your distributed system will be organised to scale horizontally across different server nodes.  The system should load balance or intelligently route between processes in the pool, and different nodes should be able to join or leave the pool without too much ceremony or impact to the application.  When you have this style of horizontal scalability, let nodes within your application crash and rejoin the pool when they’re ready.

What if we go further and implement let it crash semantics for our infrastructure?

For instance, say we have some messaging system or message broker that transports messages between the components of your application.  What if we let that crash and come back online later.  Could you design the application so that this is not as fatal as it sounds, perhaps by allowing application components to write to or dynamically switch between two message brokers?

Distributed NoSQL data stores gives us let it crash capability at the data persistence level.  Data will be stored in some distributed grid of nodes and replicated to at least 2 different hardware nodes.  At this point, it’s easier to let database nodes crash than try to achieve 100% uptime.

At the network level, we can design topologies such that we do not care if routers or  network links crash because there’s always some alternate route through the network.   Let them crash and when they come back the optimal routes will be there ready for our application to make use of again in future.

Let it crash is more than simple redundancy.  It’s about implementing self recoverability of the application.  It’s about putting your site reliability efforts into your architecture rather than low level defensive coding.  It’s about decoupling your application and introducing asynchronicity in recognition that things go wrong in surprising ways.  Ironically, sitting back and cooly letting your software crash can lead to better software!

Let it crash philosophy part II的更多相关文章

  1. Let it crash philosophy for distributed systems

    This past weekend I read Joe Armstrong’s paper on the history of Erlang. Now, HOPL papers in general ...

  2. BZOJ 2154: Crash的数字表格 [莫比乌斯反演]

    2154: Crash的数字表格 Time Limit: 20 Sec  Memory Limit: 259 MBSubmit: 2924  Solved: 1091[Submit][Status][ ...

  3. 【莫比乌斯反演】关于Mobius反演与lcm的一些关系与问题简化(BZOJ 2154 crash的数字表格&&BZOJ 2693 jzptab)

    BZOJ 2154 crash的数字表格 Description 今天的数学课上,Crash小朋友学习了最小公倍数(Least Common Multiple).对于两个正整数a和b,LCM(a, b ...

  4. 打开Voice Over时,CATextLayer的string对象兼容NSString和NSAttributedString导致的Crash(二解决思路3)

    续前一篇:打开Voice Over时,CATextLayer的string对象兼容NSString和NSAttributedString导致的Crash(二解决思路2)ok,到这里已经能够锁定范围了, ...

  5. 【bzoj 2159】Crash 的文明世界

    Description Crash小朋友最近迷上了一款游戏——文明5(Civilization V).在这个游戏中,玩家可以建立和发展自己的国家,通过外交和别的国家交流,或是通过战争征服别的国家.现在 ...

  6. 学习笔记之Machine Learning Crash Course | Google Developers

    Machine Learning Crash Course  |  Google Developers https://developers.google.com/machine-learning/c ...

  7. 【BZOJ2154】Crash的数字表格

    算是学会反演了……(其实挺好学的一天就能学会…… 原题: 今天的数学课上,Crash小朋友学习了最小公倍数(Least Common Multiple).对于两个正整数a和b,LCM(a, b)表示能 ...

  8. Java crash问题分析

    Java的应用有时候会因为各种原因Crash,这时候会产生一个类似java_errorpid.log的错误日志.可以拿到了 这个日志,怎样分析Crash的原因呢?下面我们来详细讨论如何分析java_e ...

  9. bzoj 2159: Crash 的文明世界

    Time Limit: 10 Sec  Memory Limit: 259 MB Submit: 480  Solved: 234[Submit][Status][Discuss] Descripti ...

随机推荐

  1. thinkphp5中Indirect modification of overloaded element of XXX has no effect的解决办法

    最近在使用Thinkphp5做foreach循环嵌套的时候报错:Indirect modification of overloaded element of XXX has no effect,网上搜 ...

  2. 关于FastCgi与PHP-fpm之间是个什么样的关系【转自知乎】

    刚开始对这个问题我也挺纠结的,看了<HTTP权威指南>后,感觉清晰了不少. 首先,CGI是干嘛的?CGI是为了保证web server传递过来的数据是标准格式的,方便CGI程序的编写者. ...

  3. 通过nginx + lua来统计nginx上的监控网络请求和性能

    介绍 以前我们为nginx做统计,都是通过对日志的分析来完成.比较麻烦,现在基于ngx_lua插件,开发了实时统计站点状态的脚本,解放生产力. 项目主页: https://github.com/sky ...

  4. Java 目标

    Java 技术 其次掌握的技能树主要有三个方面:第一个是基础,比如对集合类,并发包,IO/NIO,JVM,内存模型,泛型,异常,反射,等有深入了解,最好是看过源码了解底层的设计.比如一般面试都会问Co ...

  5. 探究Linux进程及线程堆栈专题<一>

    “你定义了那么多全局变量,系统才给你分配了几百KB,这样做是不是太耗内存了?”,一同学问道. 老早就听说嵌入式系统各种资源有限啊,不能分配大空间啊要注意节约资源之类的(...貌似米神4的配置要完爆我的 ...

  6. kubenetes GPU

    https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#deploying-nvidia-gpu-device-plugin 1. ...

  7. How To Use Google Flags

    [How To Use Google Flags] 1.Commandline flags are flags that users specify on the command line when ...

  8. 《GB/T 20988-2007:信息系统灾难恢复规范》[中](国家质检总局 & 国标委)阅读笔记

    第 0 章:引言 [感]GB/T 20988 引用了 SHARE 78 会议(标准)上的有关内容和思想,并结合国家重要信息系统行业技术发展和实践经验制定而成. GB/T 20988 提出了信息系统灾难 ...

  9. Spring中的注解配置-注入bean

    在使用Spring框架中@Autowired标签时默认情况下使用 @Autowired 注释进行自动注入时,Spring 容器中匹配的候选 Bean 数目必须有且仅有一个.当找不到一个匹配的 Bean ...

  10. codeforces:Michael and Charging Stations分析和实现

    题目大意 迈克尔接下来n天里分别需要支付C[1], C[2], ... , C[n]费用,但是每次支付费用可以选择使用优惠或不使用优惠,每次使用价值X的优惠那么迈克尔所能使用的优惠余量将减少X并且当天 ...