https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/

Scalable Go Scheduler Design Doc

Dmitry Vyukov

dvyukov@google.com

May 2, 2012

The document assumes some prior knowledge of the Go language and current goroutine scheduler implementation.

Problems with current scheduler

Current goroutine scheduler limits scalability of concurrent programs written in Go, in particular, high-throughput servers and parallel computational programs. Vtocc server maxes out at 70% CPU on 8-core box, while profile shows 14% is spent in runtime.futex(). In general, the scheduler may inhibit users from using idiomatic fine-grained concurrency where performance is critical.

What's wrong with current implementation:

1. Single global mutex (Sched.Lock) and centralized state. The mutex protects all goroutine-related operations (creation, completion, rescheduling, etc).

2. Goroutine (G) hand-off (G.nextg). Worker threads (M's) frequently hand-off runnable goroutines between each other, this may lead to increased latencies and additional overheads. Every M must be able to execute any runnable G, in particular the M that just created the G.

3. Per-M memory cache (M.mcache). Memory cache and other caches (stack alloc) are associated with all M's, while they need to be associated only with M's running Go code (an M blocked inside of syscall does not need mcache). A ratio between M's running Go code and all M's can be as high as 1:100. This leads to excessive resource consumption (each MCache can suck up up to 2M) and poor data locality.

4. Aggressive thread blocking/unblocking. In presence of syscalls worker threads are frequently blocked and unblocked. This adds a lot of overhead.

Design

Processors

The general idea is to introduce a notion of P (Processors) into runtime and implement work-stealing scheduler on top of Processors.

M represents OS thread (as it is now). P represents a resource that is required to execute Go code. When M executes Go code, it has an associated P. When M is idle or in syscall, it does need P.

There is exactly GOMAXPROCS P’s. All P’s are organized into an array, that is a requirement of work-stealing. GOMAXPROCS change involves stop/start the world to resize array of P’s.

Some variables from sched are de-centralized and moved to P. Some variables from M are moved to P (the ones that relate to active execution of Go code).

struct P

{

Lock;

G *gfree; // freelist, moved from sched

G *ghead; // runnable, moved from sched

G *gtail;

MCache *mcache; // moved from M

FixAlloc *stackalloc; // moved from M

uint64 ncgocall;

GCStats gcstats;

// etc

...

};

P *allp; // [GOMAXPROCS]

There is also a lock-free list of idle P’s:

P *idlep; // lock-free list

When an M is willing to start executing Go code, it must pop a P form the list. When an M ends executing Go code, it pushes the P to the list. So, when M executes Go code, it necessary has an associated P. This mechanism replaces sched.atomic (mcpu/mcpumax).

Scheduling

When a new G is created or an existing G becomes runnable, it is pushed onto a list of runnable goroutines of current P. When P finishes executing G, it first tries to pop a G from own list of runnable goroutines; if the list is empty, P chooses a random victim (another P) and tries to steal a half of runnable goroutines from it.

Syscalls/M Parking and Unparking

When an M creates a new G, it must ensure that there is another M to execute the G (if not all M’s are already busy). Similarly, when an M enters syscall, it must ensure that there is another M to execute Go code.

There are two options, we can either promptly block and unblock M’s, or employ some spinning. Here is inherent conflict between performance and burning unnecessary CPU cycles. The idea is to use spinning and do burn CPU cycles. However, it should not affect programs running with GOMAXPROCS=1 (command line utilities, appengine, etc).

Spinning is two-level: (1) an idle M with an associated P spins looking for new G’s, (2) an M w/o an associated P spins waiting for available P’s. There are at most GOMAXPROCS spinning M’s (both (1) and (2)). Idle M’s of type (1) do not block while there are idle M’s of type (2).

When a new G is spawned, or M enters syscall, or M transitions from idle to busy, it ensures that there is at least 1 spinning M (or all P’s are busy). This ensures that there are no runnable G’s that can be otherwise running; and avoids excessive M blocking/unblocking at the same time.

Spinning is mostly passive (yield to OS, sched_yield()), but may include a little bit of active spinning (loop burnging CPU) (requires investigation and tuning).

Termination/Deadlock Detection

Termination/deadlock detection is more problematic in a distributed system. The general idea is to do the checks only when all P’s are idle (global atomic counter of idle P’s), this allows to do more expensive checks that involve aggregation of per-P state.

No details yet.

LockOSThread

This functionality is not performance-critical.

1. Locked G become non-runnable (Gwaiting). M instantly returns P to idle list, wakes up another M and blocks.

2. Locked G becomes runnable (and reaches head of the runq). Current M hands off own P and locked G to the M associated with the locked G, and unblocks it. Current M becomes idle.

Idle G

This functionality is not performance-critical.

There is a global queue of (or a single?) idle G. An M that looks for work checks the queue after several unsuccessful steal attempts.

Implementation Plan

The goal is to split the whole thing into minimal parts that can be independently reviewed and submitted.

1. Introduce the P struct (empty for now); implement allp/idlep containers (idlep is mutex-protected for starters); associate a P with M running Go code. Global mutex and atomic state is still preserved.

2. Move G freelist to P.

3. Move mcache to P.

4. Move stackalloc to P.

5. Move ncgocall/gcstats to P.

6. Decentralize run queue, implement work-stealing. Eliminate G hand off. Still under global mutex.

7. Remove global mutex, implement distributed termination detection, LockOSThread.

8. Implement spinning instead of prompt blocking/unblocking.

The plan may turn out to not work, there are a lot of unexplored details.

Potential Further Improvements

1. Try out LIFO scheduling, this will improve locality. However, it still must provide some degree of fairness and gracefully handle yielding goroutines.

2. Do not allocate G and stack until the goroutine first runs. For a newly created goroutine we need just callerpc, fn, narg, nret and args, that is, about 6 words. This will allow to create a lot of running-to-completion goroutines with significantly lower memory overhead.

4. Better locality of G-to-P. Try to enqueue an unblocked G to a P on which it was last running.

5. Better locality of P-to-M. Try to execute P on the same M it was last running.

6. Throttling of M creation. The scheduler can be easily forced to create thousands of M's per second until OS refuses to create more threads. M’s must be created promptly up to k*GOMAXPROCS, after that new M’s may added by a timer.

Random Notes

- GOMAXPROCS won’t go away as a result of this work.

Scalable Go Scheduler Design Doc的更多相关文章

  1. Design Doc: Session History for Out-of-Process iframes

    Design Doc: Session History for Out-of-Process iframes Charlie Reis, May 2014 This document outlines ...

  2. golang的并发不等于并行

    先 看下面一道面试题: func main() { runtime.GOMAXPROCS(1) wg := sync.WaitGroup{} wg.Add(20) for i := 0; i < ...

  3. golang ----并发 && 并行

    Go 语言的线程是并发机制,不是并行机制. 那么,什么是并发,什么是并行? 并发是不同的代码块交替执行,也就是交替可以做不同的事情. 并行是不同的代码块同时执行,也就是同时可以做不同的事情. 举个生活 ...

  4. go runtime scheduler

     http://www.slideshare.net/matthewrdale/demystifying-the-go-scheduler http://www.cs.columbia.edu/~a ...

  5. 深度解密Go语言之 scheduler

    目录 前置知识 os scheduler 线程切换 函数调用过程分析 goroutine 是怎么工作的 什么是 goroutine goroutine 和 thread 的区别 M:N 模型 什么是 ...

  6. Golang/Go goroutine调度器原理/实现【原】

    Go语言在2016年再次拿下TIBOE年度编程语言称号,这充分证明了Go语言这几年在全世界范围内的受欢迎程度.如果要对世界范围内的gopher发起一次“你究竟喜欢Go的哪一点”的调查,我相信很多Gop ...

  7. The Go scheduler

    转载自:http://morsmachine.dk/go-scheduler Introduction One of the big features for Go 1.1 is the new sc ...

  8. Zero-input latency scheduler: Scheduler Overhaul

    Scheduler Overhaul, with contributions from rbyers, sadrul, rjkroege, sievers, epenner, skyostil, br ...

  9. API Design Principles -- QT Project

    [the original link] One of Qt’s most reputed merits is its consistent, easy-to-learn, powerfulAPI. T ...

随机推荐

  1. Excel 数据对比,窗口并列排序操作(xlw文件格式的由来)

    步骤1:打开Excel文件,输入一些数据 步骤2:点击视图,创建新窗口(这里就会创建一个和步骤1一抹一样的的表格,我们可以在任务栏上看到) 第三步:点击视图里面的全部重排按钮,在重拍窗口里面选择需要拍 ...

  2. 道高一丈,且看CWE4.2的新特性

    摘要:CWE在今年2/24发布4.0,首次将硬件安全漏洞纳入了CWE中,6/25发布4.1, 8/20就发布了4.2. 1. 按照惯例,先说故事 我们先说下CWE的幕后老板--MITRE[1]. MI ...

  3. c#——ToString()的各种用法

    ToString()的各种用法 string str = ""; str = 123456.ToString("N"); //生成 12,3456.00 str ...

  4. 基于MongoDB权限管理+gridfs文件上传------云盘系统

    学了一会Mongo,开始毕设的编写. 毕设目前一共分为如下模块 用户管理模块 管理员管理模块 文件管理模块 分享模块 目前已经完成了权限管理部分的后端代码.上传下载已经实现Demo.先把权限弄好后在整 ...

  5. Docker安装系列教程

    首先准备一台Centos7版本的虚拟机,它支持docker容器技术.本案例使用centos7虚拟机安装docker容器. 一.安装 1.启动虚拟机,配置虚拟机能够访问互联网 2. 安装支持软件包,提供 ...

  6. 如何在 Linux 系统查询机器最近重启时间

    如何在 Linux 系统查询机器最近重启时间 在你的 Linux 或类 UNIX 系统中,你是如何查询系统上次重新启动的日期和时间?怎样显示系统关机的日期和时间? last 命令不仅可以按照时间从近到 ...

  7. 自动化运维工具-Ansible之7-roles

    自动化运维工具-Ansible之7-roles 目录 自动化运维工具-Ansible之7-roles Ansible Roles基本概述 Ansible Roles目录结构 Ansible Roles ...

  8. 数据库(MySQL)最新版8.0安装教程,小白都能学会安装

    首先打开数据库官网 接下来点击不用登录注册 下载好软件,双击运行程序(中间不需要点击其他,等他运行好) 点击安装服务端 ,然后点击下一步 选择自己安装目录(一定要牢记)这里我选择默认目录,点击下一步 ...

  9. ElasticSearch教程——filter与query对比(转学习使用)

    一.数据准备 PUT /company/employee/2 { "address": { "country": "china", &quo ...

  10. linux操作系统可以ping通ssh连接长时间无响应

    一.问题描述 某集群数据节点服务器频繁无法连接,服务器间出现可ping通但ssh无法连接的情况,使用带外地址登录后远程控制也无法显示正常界面,重启后会短暂恢复. 二.排查问题 重启服务器后检查服务器S ...