PEP地址:

https://peps.python.org/pep-0703/

PEP 703 – Making the Global Interpreter Lock Optional in CPython

================================================

Abstract

CPython’s global interpreter lock (“GIL”) prevents multiple threads from executing Python code at the same time. The GIL is an obstacle to using multi-core CPUs from Python efficiently. This PEP proposes adding a build configuration (--disable-gil) to CPython to let it run Python code without the global interpreter lock and with the necessary changes needed to make the interpreter thread-safe.

Motivation

The GIL is a major obstacle to concurrency. For scientific computing tasks, this lack of concurrency is often a bigger issue than speed of executing Python code, since most of the processor cycles are spent in optimized CPU or GPU kernels. The GIL introduces a global bottleneck that can prevent other threads from making progress if they call any Python code. There are existing ways to enable parallelism in CPython today, but those techniques come with significant limitations (see Alternatives).

This section focuses on the GIL’s impact on scientific computing, particular AI/ML workloads because that is the area with which this author has the most experience, but the GIL also affects other users of Python.

The GIL Makes Many Types of Parallelism Difficult to Express

Neural network-based AI models expose multiple opportunities for parallelism. For example, individual operations may be parallelized internally (“intra-operator”), multiple operations may be executed simultaneously (“inter-operator”), and requests (spanning multiple operations) may also be parallelized. Efficient execution requires exploiting multiple types of parallelism [1].

The GIL makes it difficult to express inter-operator parallelism, as well as some forms of request parallelism, efficiently in Python. In other programming languages, a system might use threads to run different parts of a neural network on separate CPU cores, but this is inefficient in Python due to the GIL. Similarly, latency-sensitive inference workloads frequently use threads to parallelize across requests, but face the same scaling bottlenecks in Python.

The challenges the GIL poses to exploiting parallelism in Python frequently come up in reinforcement learning. Heinrich Kuttler, author of the NetHack Learning Environment and Member of Technical Staff at Inflection AI, writes:

Recent breakthroughs in reinforcement learning, such as on Dota 2, StarCraft, and NetHack rely on running multiple environments (simulated games) in parallel using asynchronous actor-critic methods. Straightforward multithreaded implementations in Python don’t scale beyond more than a few parallel environments due to GIL contention. Multiprocessing, with communication via shared memory or UNIX sockets, adds much complexity and in effect rules out interacting with CUDA from different workers, severely restricting the design space.

Manuel Kroiss, software engineer at DeepMind on the reinforcement learning team, describes how the bottlenecks posed by the GIL lead to rewriting Python codebases in C++, making the code less accessible:

We frequently battle issues with the Python GIL at DeepMind. In many of our applications, we would like to run on the order of 50-100 threads per process. However, we often see that even with fewer than 10 threads the GIL becomes the bottleneck. To work around this problem, we sometimes use subprocesses, but in many cases the inter-process communication becomes too big of an overhead. To deal with the GIL, we usually end up translating large parts of our Python codebase into C++. This is undesirable because it makes the code less accessible to researchers.

Projects that involve interfacing with multiple hardware devices face similar challenges: efficient communication requires use of multiple CPU cores. The Dose-3D project aims to improve cancer radiotherapy with precise dose planning. It uses medical phantoms (stand-ins for human tissue) together with custom hardware and a server application written in Python. Paweł Jurgielewicz, lead software architect for the data acquisition system on the Dose-3D project, describes the scaling challenges posed by the GIL and how using a fork of Python without the GIL simplified the project:

In the Dose-3D project, the key challenge was to maintain a stable, non-trivial concurrent communication link with hardware units while utilizing a 1 Gbit/s UDP/IP connection to the maximum. Naturally, we started with the multiprocessing package, but at some point, it became clear that most CPU time was consumed by the data transfers between the data processing stages, not by data processing itself. The CPython multithreading implementation based on GIL was a dead end too. When we found out about the “nogil” fork of Python it took a single person less than half a working day to adjust the codebase to use this fork and the results were astonishing. Now we can focus on data acquisition system development rather than fine-tuning data exchange algorithms.

Allen Goodman, author of CellProfiler and staff engineer at Prescient Design and Genentech, describes how the GIL makes biological methods research more difficult in Python:

Issues with Python’s global interpreter lock are a frequent source of frustration throughout biological methods research.

I wanted to better understand the current multithreading situation so I reimplemented parts of HMMER, a standard method for multiple-sequence alignment. I chose this method because it stresses both single-thread performance (scoring) and multi-threaded performance (searching a database of sequences). The GIL became the bottleneck when using only eight threads. This is a method where the current popular implementations rely on 64 or even 128 threads per process. I tried moving to subprocesses but was blocked by the prohibitive IPC costs. HMMER is a relatively elementary bioinformatics method and newer methods have far bigger multi-threading demands.

Method researchers are begging to use Python (myself included), because of its ease of use, the Python ecosystem, and because “it’s what people know.” Many biologists only know a little bit of programming (and that’s almost always Python). Until Python’s multithreading situation is addressed, C and C++ will remain the lingua franca of the biological methods research community.

The GIL Affects Python Library Usability

The GIL is a CPython implementation detail that limits multithreaded parallelism, so it might seem unintuitive to think of it as a usability issue. However, library authors frequently care a great deal about performance and will design APIs that support working around the GIL. These workaround frequently lead to APIs that are more difficult to use. Consequently, users of these APIs may experience the GIL as a usability issue and not just a performance issue.

For example, PyTorch exposes a multiprocessing-based API called DataLoader for building data input pipelines. It uses fork() on Linux because it is generally faster and uses less memory than spawn(), but this leads to additional challenges for users: creating a DataLoader after accessing a GPU can lead to confusing CUDA errors. Accessing GPUs within a DataLoader worker quickly leads to out-of-memory errors because processes do not share CUDA contexts (unlike threads within a process).

Olivier Grisel, scikit-learn developer and software engineer at Inria, describes how having to work around the GIL in scikit-learn related libraries leads to a more complex and confusing user experience:

Over the years, scikit-learn developers have maintained ancillary libraries such as joblib and loky to try to work around some of the limitations of multiprocessing: extra memory usage partially mitigated via semi-automated memory mapping of large data buffers, slow worker startup by transparently reusing a pool of long running workers, fork-safety problems of third-party native runtime libraries such as GNU OpenMP by never using the fork-only start-method, ability to perform parallel calls of interactively defined functions in notebooks and REPLs in cross-platform manner via cloudpickle. Despite our efforts, this multiprocessing-based solution is still brittle, complex to maintain and confusing to datascientists with limited understanding of system-level constraints. Furthermore, there are still irreducible limitations such as the overhead caused by the pickle-based serialization/deserialization steps required for inter-process communication. A lot of this extra work and complexity would not be needed anymore if we could use threads without contention on multicore hosts (sometimes with 64 physical cores or more) to run data science pipelines that alternate between Python-level operations and calls to native libraries.

Ralf Gommers, co-director of Quansight Labs and NumPy and SciPy maintainer, describes how the GIL affects the user experience of NumPy and numeric Python libraries:

A key problem in NumPy and the stack of packages built around it is that NumPy is still (mostly) single-threaded — and that has shaped significant parts of the user experience and projects built around it. NumPy does release the GIL in its inner loops (which do the heavy lifting), but that is not nearly enough. NumPy doesn’t offer a solution to utilize all CPU cores of a single machine well, and instead leaves that to Dask and other multiprocessing solutions. Those aren’t very efficient and are also more clumsy to use. That clumsiness comes mainly in the extra abstractions and layers the users need to concern themselves with when using, e.g., dask.array which wraps numpy.ndarray. It also shows up in oversubscription issues that the user must explicitly be aware of and manage via either environment variables or a third package, threadpoolctl. The main reason is that NumPy calls into BLAS for linear algebra - and those calls it has no control over, they do use all cores by default via either pthreads or OpenMP.

Coordinating on APIs and design decisions to control parallelism is still a major amount of work, and one of the harder challenges across the PyData ecosystem. It would have looked a lot different (better, easier) without a GIL.

GPU-Heavy Workloads Require Multi-Core Processing

Many high-performance computing (HPC) and AI workloads make heavy use of GPUs. These applications frequently require efficient multi-core CPU execution even though the bulk of the computation runs on a GPU.

Zachary DeVito, PyTorch core developer and researcher at FAIR (Meta AI), describes how the GIL makes multithreaded scaling inefficient even when the bulk of computation is performed outside of Python:

In PyTorch, Python is commonly used to orchestrate ~8 GPUs and ~64 CPU threads, growing to 4k GPUs and 32k CPU threads for big models. While the heavy lifting is done outside of Python, the speed of GPUs makes even just the orchestration in Python not scalable. We often end up with 72 processes in place of one because of the GIL. Logging, debugging, and performance tuning are orders-of-magnitude more difficult in this regime, continuously causing lower developer productivity.

The use of many processes (instead of threads) makes common tasks more difficult. Zachary DeVito continues:

On three separate occasions in the past couple of months (reducing redundant compute in data loaders, writing model checkpoints asynchronously, and parallelizing compiler optimizations), I spent an order-of-magnitude more time figuring out how to work around GIL limitations than actually solving the particular problem.

Even GPU-heavy workloads frequently have a CPU-intensive component. For example, computer vision tasks typically require multiple “pre-processing” steps in the data input pipeline, like image decoding, cropping, and resizing. These tasks are commonly performed on the CPU and may use Python libraries like Pillow or Pillow-SIMD. It is necessary to run the data input pipeline on multiple CPU cores in order to keep the GPU “fed” with data.

The increase in GPU performance compared to individual CPU cores makes multi-core performance more important. It is progressively more difficult to keep the GPUs fully occupied. To do so requires efficient use of multiple CPU cores, especially on multi-GPU systems. For example, NVIDIA’s DGX-A100 has 8 GPUs and two 64-core CPUs in order to keep the GPUs “fed” with data.

The GIL Makes Deploying Python AI Models Difficult

Python is widely used to develop neural network-based AI models. In PyTorch, models are frequently deployed as part of multi-threaded, mostly C++, environments. Python is often viewed skeptically because the GIL can be a global bottleneck, preventing efficient scaling even though the vast majority of the computations occur “outside” of Python with the GIL released. The torchdeploy paper [2] shows experimental evidence for these scaling bottlenecks in multiple model architectures.

PyTorch provides a number of mechanisms for deploying Python AI models that avoid or work around the GIL, but they all come with substantial limitations. For example, TorchScript captures a representation of the model that can be executed from C++ without any Python dependencies, but it only supports a limited subset of Python and often requires rewriting some of the model’s code. The torch::deploy API allows multiple Python interpreters, each with its own GIL, in the same process(similar to PEP 684). However, torch::deploy has limited support for Python modules that use C-API extensions.

Motivation Summary

Python’s global interpreter lock makes it difficult to use modern multi-core CPUs efficiently for many scientific and numeric computing applications. Heinrich Kuttler, Manuel Kroiss, and Paweł Jurgielewicz found that multi-threaded implementations in Python did not scale well for their tasks and that using multiple processes was not a suitable alternative.

The scaling bottlenecks are not solely in core numeric tasks. Both Zachary DeVito and Paweł Jurgielewicz described challenges with coordination and communication in Python.

Olivier Grisel, Ralf Gommers, and Zachary DeVito described how current workarounds for the GIL are “complex to maintain” and cause “lower developer productivity.” The GIL makes it more difficult to develop and maintain scientific and numeric computing libraries as well leading to library designs that are more difficult to use.

以下略。

================================================

相关:

https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsDFosB5e6BfnXLlejd9l0/edit?pli=1#heading=h.jcxfoklnvp0i

关于python的GIL的解除——PEP 703 – Making the Global Interpreter Lock Optional in CPython的更多相关文章

  1. Python GIL(Global Interpreter Lock)

    一,介绍 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native t ...

  2. python之GIL官方文档 global interpreter lock 全局解释器锁

    0.目录 2. 术语 global interpreter lock 全局解释器锁3. C-API 还有更多没有仔细看4. 定期切换线程5. wiki.python6. python.doc FAQ ...

  3. Python GIL(Global Interpreter Lock)

    一.介绍 In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threa ...

  4. python之GIL(Global Interpreter Lock)

    一 介绍 ''' 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple nati ...

  5. python GIL(Global Interpreter Lock)

    一 介绍 ''' 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple nati ...

  6. Python解释器是单线程应用 IO 密集型 计算密集型 GIL global interpreter lock

    [Python解释器是单线程应用] [任意时刻,仅执行一个线程] 尽管Python解释器中可以运行多个线程,但是在任意给定的时刻只有一个线程会被解释器执行. [GIL锁 保证同时只有一个线程运行] 对 ...

  7. Python3 GIL(Global Interpreter Lock)与多线程

    GIL(Global Interpreter Lock)与多线程 GIL介绍 GIL与Lock GIL与多线程 多线程性能测试 在Cpython解释器中,同一个进程下开启的多线程,同一时刻只能有一个线 ...

  8. 基于Cpython的 GIL(Global Interpreter Lock)

    一 介绍 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native t ...

  9. GIL - global interpreter lock

    python是一个解释型语言,但是可以使用多个解释器.比如C++,但是可以用不同的编译器来编译成可执行代码.有名的编译器例如GCC,INTEL C++,Visual C++等.Python也一样,同样 ...

  10. [转载] Python的GIL是什么鬼,多线程性能究竟如何

    原文: http://cenalulu.github.io/python/gil-in-python/ GIL是什么 首先需要明确的一点是GIL并不是Python的特性,它是在实现Python解析器( ...

随机推荐

  1. 别想宰我,怎么查看云厂商是否超卖?详解 cpu steal time

    据说有些云厂商会超卖,宿主有 96 个核心,结果卖出去 100 多个 vCPU,如果这些虚机负载都不高,大家相安无事,如果这些虚机同时运行一些高负载的任务,相互之间就会抢占 CPU,对应用程序有较大影 ...

  2. HBase2版本的修复工具HBCK2

    一.hbase出现的问题 1.元数据表hbase:namespace 不在线 导致查询数据时 master is initing 2.一些表的region一直处于opening状态 3.region ...

  3. mysql5.7msi安装

    本文介绍的是只安装MySQL数据库的过程,并不包含各种其他附加工具.安装完成之后通常使用Navicat或SQLyog进行可视化操作. 清华的镜像网站只保存最新的几个MySQL版本,所以直链可能已经失效 ...

  4. 多核处理器与MP架构

    多核处理器也称片上多核处理器(Chip Multi-Processor,CMP). 多核处理器的流行 多核出现前,商业化处理器都致力于单核处理器的发展,其性能已经发挥到极致,仅仅提高单核芯片的速度会产 ...

  5. yb 课堂实战之视频列表接口开发+API权限路径规划 《三》

    开发JsonData工具类 package net.ybclass.online_ybclass.utils; public class JsonData { /** * 状态码,0表示成功过,1表示 ...

  6. Oracle plsql中文字段乱码,where条件中文字段搜不到结果集

    设置系统环境变量 变量名:NLS_LANG 变量值:AMERICAN_AMERICA.ZHS16GBK

  7. Linux-shell编程入门基础

    目录 前言 Shell编程 bash特性 shell作用域 变量 环境变量 $特殊变量 $特殊状态变量 $特殊符号(很重要) 其他内置shell命令 shell语法的子串截取 统计 指令执行时间 练习 ...

  8. 解决方案 | onenote无法同步,显示:证书错误,应用程序在加载SSL库是遇到内部错误。

    解决方案:一般是公司网络或者学校网络的问题,更换手机使用的数据流量热点无线网络即可.

  9. webgl未使用独立显卡报告2

    楔子 在上一篇文章 <# [https://juejin.cn/post/7074771064286347301] webgl未使用独立显卡报告> 发表后,有读者在公众号给我发了一段评论, ...

  10. PAT-1002 写出这个数 (20分) JavaScript(node)

    读入一个正整数 n,计算其各位数字之和,用汉语拼音写出和的每一位数字. 输入格式: 每个测试输入包含 1 个测试用例,即给出自然数 n 的值.这里保证 n 小于 10100​​ . 输出格式: 在一行 ...