This past weekend I read Joe Armstrong’s paper on the history of Erlang. Now, HOPL papers in general are like candy for me, and this one did not disappoint. There’s more in this paper that I can cover in one post, so today I’m going to concentrate on one particular feature of Erlang highlighted by Armstrong.

Although Erlang is designed to encourage/facilitate a massively parallel programming style, its error handling may be even more noteworthy. Like everything else in Erlang, its error handling is designed to be distributed, and for good reason:

Error handling in Erlang is very different from error handling in conventional programming languages. The key observation here is to note that the error-handling mechanisms were designed for building fault-tolerant systems, and not merely for protecting from program exceptions. You cannot build a fault-tolerant system if you only have one computer. The minimal configuration for a fault tolerant system has two computers. These must be configured so that both observe each other. If one of the computers crashes, then the other computer must take over whatever the first computer was doing.

This means that the model for error handling is based on the idea of two computers that observe each other.

Erlang is famous for its features which help programmers to produce stable systems in the real world. Its shared nothing architecture and ability to hot-swap code are well-known. But these features are available in other systems. The "links" feature, on the other hand, seems to be unique. When you create a process in Erlang, you can link it to another process; this link essentially means, "If that process crashes, I’d like to crash, also; and if I crash, that process should die, too." Here is Armstrong’s description:

Links in Erlang are provided to control error propagation paths for errors between processes. An Erlang process will die if it evaluates illegal code, so, for example, if a process tries to divide by zero it will die. The basic model of error handling is to assume that some other process in the system will observe the death of the process and take appropriate corrective actions. But which process in the system should do this? If there are several thousand processes in the system then how do we know which process to inform when an error occurs? The answer is the linked process. If some process A evaluates the primitive link(B) then it becomes linked to A . If A dies then B is informed. If B dies then A is informed.

Using links, we can create sets of processes that are linked together. If these are normal processes, they will die immediately if they are linked to a process that dies with an error. The idea here is to create sets of processes such that if any process in the set dies, then they will all die. This mechanism provides the invariant that either all the processes in the set are alive or none of them are. This is very useful for programming error-recovery strategies in complex situations. As far as I know, no other programming language has anything remotely like this.

In addition to simply killing the linked process, the link can also function as a kind of signal to a system process that a group of processes have died, so that appropriate action can be taken, such as restarting a process group.

Like Armstrong, I cannot think of another system that works quite this way. The closest analogy I can think of is a distributed transaction. But distributed transactions have quite a bit more overhead, because they’re all about providing serializable access to shared data, which Erlang just doesn’t allow.

Armstrong says that the idea of links was inspired by the ‘C wire’ in early telephone exchanges:

The C wire went back to the exchange and through all the electromechanical relays involved in setting up a call. If anything went wrong, or if either partner terminated the call, then the C wire was grounded. Grounding the C wire caused a knock-on effect in the exchange that freed all resources connected to the C line.

Armstrong says that the links feature encourages a worker/supervisor style of programming which is "not possible in a single threaded language."

Let it crash philosophy for distributed systems的更多相关文章

  1. Let it crash philosophy part II

    Designing fault tolerant systems is extremely difficult.  You can try to anticipate and reason about ...

  2. Distributed systems theory for the distributed systems engineer

    Gwen Shapira, SA superstar and now full-time engineer at Cloudera, asked a question on Twitter that ...

  3. 可扩展的Web系统和分布式系统(Scalable Web Architecture and Distributed Systems)

    Open source software has become a fundamental building block for some of the biggest websites. And a ...

  4. Scalable Web Architecture and Distributed Systems

    转自:http://aosabook.org/en/distsys.html Scalable Web Architecture and Distributed Systems Kate Matsud ...

  5. Scalable, Distributed Systems Using Akka, Spring Boot, DDD, and Java--转

    原文地址:https://dzone.com/articles/scalable-distributed-systems-using-akka-spring-boot-ddd-and-java Whe ...

  6. [翻译] TensorFlow 分布式之论文篇 "TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems"

    [翻译] TensorFlow 分布式之论文篇 "TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed ...

  7. [分布式系统学习]阅读笔记 Distributed systems for fun and profit 抽象 之二

    本文是阅读 http://book.mixu.net/distsys/abstractions.html 的笔记. 第二章的题目是"Up and down the level of abst ...

  8. [分布式系统学习]阅读笔记 Distributed systems for fun and profit 之一 基本概念

    因为工作的原因,最近打算看一些分布式学习的资料.其中这个http://book.mixu.net/distsys/就是一篇非常适合分布式入门的介绍. 这个短小的材料有下面5个小的章节,图文并茂,也没有 ...

  9. Fault-Tolerance, Fast and Slow: Exploiting Failure Asynchrony in Distributed Systems

    本文(OSDI 18')主要介绍一种新的副本复制协议:SAUCR(场景可感知的更新与故障恢复).它是一种混合的协议: 在一定场景(正常情况)下:副本复制的数据缓存在内存中. 故障发生时(多个节点挂掉, ...

随机推荐

  1. 用jquery操作字体颜色覆盖当前页面的css设置

    直接使用css操作color的时候,!important一直不生效,记录下,使用下面的可以起作用 用jquery操作字体颜色覆盖当前页面的css设置 $('a[href="?p=home&a ...

  2. 如何用git命令行上传本地代码到github

    注意:安装的前提条件是配置好Git的相关环境或者安装好git.exe,此处不再重点提及 上传的步骤: 本文采用git 命令界面进行操作,先执行以下两个命令,配置用户名和email[设置用戶名和e-ma ...

  3. 迷你MVVM框架 avalonjs 1.2发布

    avalon1.2 带来了许多新特性,让开发更轻松!详见如下: 升级路由系统与分页组件. 对ms-duplex的绑定值进行增强,以前只能prop或prop.prop2,现在可以prop["x ...

  4. Spring-JDBC模板-事务

    Spring-JDBC模板-事务 1.事务概述 什么是事务 逻辑上的一组操作,组成这组操作的各个单元要么全部成功要么全部失败 事务的特点ACID 原子性:事务不可分割(事务要么成功,要么失败) 一致性 ...

  5. Python3 sorted() 函数

    Python3 sorted() 函数  Python3 内置函数 描述 sorted() 函数对所有可迭代的对象进行排序操作. sort 与 sorted 区别: sort 是应用在 list 上的 ...

  6. jQuery插件Highcharts

    Highcharts 是一个用纯 JavaScript 编写的一个图表库, 能够很简单便捷的在 Web 网站或是 Web 应用程序添加有交互性的图表,并且免费提供给个人学习.个人网站和非商业用途使用. ...

  7. maven标签说明

    <project xmlns="http://maven.apache.org/POM/4.0.0 " xmlns:xsi="http://www.w3.org/2 ...

  8. Error while trying to retrieve text for error ORA-12154

    问题描述:vs中调试运行没有任何错误,但是发布到IIS中访问,就会报以上错误.IIS不会调试,所以一头雾水,不止错误在哪里. 分析:看到网上有人分析了Web.config模拟验证的问题恍然大悟: 原文 ...

  9. Sprint boot notes

    1. spring.io 官网 2. http://javainuse.com/spring/sprboot spring boot学习资源 3. spring boot websocketss视频  ...

  10. 安装bcmath 扩展

    1.在php源码包中,默认就包含bcmath扩展的安装文件,只需手动安装一下即可 cd /root/build2/php-/ext/bcmath // 进入PHP的源码包目录中的bcmatch扩展目录 ...