PEP地址:

https://peps.python.org/pep-0703/

PEP 703 – Making the Global Interpreter Lock Optional in CPython

================================================

Abstract

CPython’s global interpreter lock (“GIL”) prevents multiple threads from executing Python code at the same time. The GIL is an obstacle to using multi-core CPUs from Python efficiently. This PEP proposes adding a build configuration (--disable-gil) to CPython to let it run Python code without the global interpreter lock and with the necessary changes needed to make the interpreter thread-safe.

Motivation

The GIL is a major obstacle to concurrency. For scientific computing tasks, this lack of concurrency is often a bigger issue than speed of executing Python code, since most of the processor cycles are spent in optimized CPU or GPU kernels. The GIL introduces a global bottleneck that can prevent other threads from making progress if they call any Python code. There are existing ways to enable parallelism in CPython today, but those techniques come with significant limitations (see Alternatives).

This section focuses on the GIL’s impact on scientific computing, particular AI/ML workloads because that is the area with which this author has the most experience, but the GIL also affects other users of Python.

The GIL Makes Many Types of Parallelism Difficult to Express

Neural network-based AI models expose multiple opportunities for parallelism. For example, individual operations may be parallelized internally (“intra-operator”), multiple operations may be executed simultaneously (“inter-operator”), and requests (spanning multiple operations) may also be parallelized. Efficient execution requires exploiting multiple types of parallelism [1].

The GIL makes it difficult to express inter-operator parallelism, as well as some forms of request parallelism, efficiently in Python. In other programming languages, a system might use threads to run different parts of a neural network on separate CPU cores, but this is inefficient in Python due to the GIL. Similarly, latency-sensitive inference workloads frequently use threads to parallelize across requests, but face the same scaling bottlenecks in Python.

The challenges the GIL poses to exploiting parallelism in Python frequently come up in reinforcement learning. Heinrich Kuttler, author of the NetHack Learning Environment and Member of Technical Staff at Inflection AI, writes:

Recent breakthroughs in reinforcement learning, such as on Dota 2, StarCraft, and NetHack rely on running multiple environments (simulated games) in parallel using asynchronous actor-critic methods. Straightforward multithreaded implementations in Python don’t scale beyond more than a few parallel environments due to GIL contention. Multiprocessing, with communication via shared memory or UNIX sockets, adds much complexity and in effect rules out interacting with CUDA from different workers, severely restricting the design space.

Manuel Kroiss, software engineer at DeepMind on the reinforcement learning team, describes how the bottlenecks posed by the GIL lead to rewriting Python codebases in C++, making the code less accessible:

We frequently battle issues with the Python GIL at DeepMind. In many of our applications, we would like to run on the order of 50-100 threads per process. However, we often see that even with fewer than 10 threads the GIL becomes the bottleneck. To work around this problem, we sometimes use subprocesses, but in many cases the inter-process communication becomes too big of an overhead. To deal with the GIL, we usually end up translating large parts of our Python codebase into C++. This is undesirable because it makes the code less accessible to researchers.

Projects that involve interfacing with multiple hardware devices face similar challenges: efficient communication requires use of multiple CPU cores. The Dose-3D project aims to improve cancer radiotherapy with precise dose planning. It uses medical phantoms (stand-ins for human tissue) together with custom hardware and a server application written in Python. Paweł Jurgielewicz, lead software architect for the data acquisition system on the Dose-3D project, describes the scaling challenges posed by the GIL and how using a fork of Python without the GIL simplified the project:

In the Dose-3D project, the key challenge was to maintain a stable, non-trivial concurrent communication link with hardware units while utilizing a 1 Gbit/s UDP/IP connection to the maximum. Naturally, we started with the multiprocessing package, but at some point, it became clear that most CPU time was consumed by the data transfers between the data processing stages, not by data processing itself. The CPython multithreading implementation based on GIL was a dead end too. When we found out about the “nogil” fork of Python it took a single person less than half a working day to adjust the codebase to use this fork and the results were astonishing. Now we can focus on data acquisition system development rather than fine-tuning data exchange algorithms.

Allen Goodman, author of CellProfiler and staff engineer at Prescient Design and Genentech, describes how the GIL makes biological methods research more difficult in Python:

Issues with Python’s global interpreter lock are a frequent source of frustration throughout biological methods research.

I wanted to better understand the current multithreading situation so I reimplemented parts of HMMER, a standard method for multiple-sequence alignment. I chose this method because it stresses both single-thread performance (scoring) and multi-threaded performance (searching a database of sequences). The GIL became the bottleneck when using only eight threads. This is a method where the current popular implementations rely on 64 or even 128 threads per process. I tried moving to subprocesses but was blocked by the prohibitive IPC costs. HMMER is a relatively elementary bioinformatics method and newer methods have far bigger multi-threading demands.

Method researchers are begging to use Python (myself included), because of its ease of use, the Python ecosystem, and because “it’s what people know.” Many biologists only know a little bit of programming (and that’s almost always Python). Until Python’s multithreading situation is addressed, C and C++ will remain the lingua franca of the biological methods research community.

The GIL Affects Python Library Usability

The GIL is a CPython implementation detail that limits multithreaded parallelism, so it might seem unintuitive to think of it as a usability issue. However, library authors frequently care a great deal about performance and will design APIs that support working around the GIL. These workaround frequently lead to APIs that are more difficult to use. Consequently, users of these APIs may experience the GIL as a usability issue and not just a performance issue.

For example, PyTorch exposes a multiprocessing-based API called DataLoader for building data input pipelines. It uses fork() on Linux because it is generally faster and uses less memory than spawn(), but this leads to additional challenges for users: creating a DataLoader after accessing a GPU can lead to confusing CUDA errors. Accessing GPUs within a DataLoader worker quickly leads to out-of-memory errors because processes do not share CUDA contexts (unlike threads within a process).

Olivier Grisel, scikit-learn developer and software engineer at Inria, describes how having to work around the GIL in scikit-learn related libraries leads to a more complex and confusing user experience:

Over the years, scikit-learn developers have maintained ancillary libraries such as joblib and loky to try to work around some of the limitations of multiprocessing: extra memory usage partially mitigated via semi-automated memory mapping of large data buffers, slow worker startup by transparently reusing a pool of long running workers, fork-safety problems of third-party native runtime libraries such as GNU OpenMP by never using the fork-only start-method, ability to perform parallel calls of interactively defined functions in notebooks and REPLs in cross-platform manner via cloudpickle. Despite our efforts, this multiprocessing-based solution is still brittle, complex to maintain and confusing to datascientists with limited understanding of system-level constraints. Furthermore, there are still irreducible limitations such as the overhead caused by the pickle-based serialization/deserialization steps required for inter-process communication. A lot of this extra work and complexity would not be needed anymore if we could use threads without contention on multicore hosts (sometimes with 64 physical cores or more) to run data science pipelines that alternate between Python-level operations and calls to native libraries.

Ralf Gommers, co-director of Quansight Labs and NumPy and SciPy maintainer, describes how the GIL affects the user experience of NumPy and numeric Python libraries:

A key problem in NumPy and the stack of packages built around it is that NumPy is still (mostly) single-threaded — and that has shaped significant parts of the user experience and projects built around it. NumPy does release the GIL in its inner loops (which do the heavy lifting), but that is not nearly enough. NumPy doesn’t offer a solution to utilize all CPU cores of a single machine well, and instead leaves that to Dask and other multiprocessing solutions. Those aren’t very efficient and are also more clumsy to use. That clumsiness comes mainly in the extra abstractions and layers the users need to concern themselves with when using, e.g., dask.array which wraps numpy.ndarray. It also shows up in oversubscription issues that the user must explicitly be aware of and manage via either environment variables or a third package, threadpoolctl. The main reason is that NumPy calls into BLAS for linear algebra - and those calls it has no control over, they do use all cores by default via either pthreads or OpenMP.

Coordinating on APIs and design decisions to control parallelism is still a major amount of work, and one of the harder challenges across the PyData ecosystem. It would have looked a lot different (better, easier) without a GIL.

GPU-Heavy Workloads Require Multi-Core Processing

Many high-performance computing (HPC) and AI workloads make heavy use of GPUs. These applications frequently require efficient multi-core CPU execution even though the bulk of the computation runs on a GPU.

Zachary DeVito, PyTorch core developer and researcher at FAIR (Meta AI), describes how the GIL makes multithreaded scaling inefficient even when the bulk of computation is performed outside of Python:

In PyTorch, Python is commonly used to orchestrate ~8 GPUs and ~64 CPU threads, growing to 4k GPUs and 32k CPU threads for big models. While the heavy lifting is done outside of Python, the speed of GPUs makes even just the orchestration in Python not scalable. We often end up with 72 processes in place of one because of the GIL. Logging, debugging, and performance tuning are orders-of-magnitude more difficult in this regime, continuously causing lower developer productivity.

The use of many processes (instead of threads) makes common tasks more difficult. Zachary DeVito continues:

On three separate occasions in the past couple of months (reducing redundant compute in data loaders, writing model checkpoints asynchronously, and parallelizing compiler optimizations), I spent an order-of-magnitude more time figuring out how to work around GIL limitations than actually solving the particular problem.

Even GPU-heavy workloads frequently have a CPU-intensive component. For example, computer vision tasks typically require multiple “pre-processing” steps in the data input pipeline, like image decoding, cropping, and resizing. These tasks are commonly performed on the CPU and may use Python libraries like Pillow or Pillow-SIMD. It is necessary to run the data input pipeline on multiple CPU cores in order to keep the GPU “fed” with data.

The increase in GPU performance compared to individual CPU cores makes multi-core performance more important. It is progressively more difficult to keep the GPUs fully occupied. To do so requires efficient use of multiple CPU cores, especially on multi-GPU systems. For example, NVIDIA’s DGX-A100 has 8 GPUs and two 64-core CPUs in order to keep the GPUs “fed” with data.

The GIL Makes Deploying Python AI Models Difficult

Python is widely used to develop neural network-based AI models. In PyTorch, models are frequently deployed as part of multi-threaded, mostly C++, environments. Python is often viewed skeptically because the GIL can be a global bottleneck, preventing efficient scaling even though the vast majority of the computations occur “outside” of Python with the GIL released. The torchdeploy paper [2] shows experimental evidence for these scaling bottlenecks in multiple model architectures.

PyTorch provides a number of mechanisms for deploying Python AI models that avoid or work around the GIL, but they all come with substantial limitations. For example, TorchScript captures a representation of the model that can be executed from C++ without any Python dependencies, but it only supports a limited subset of Python and often requires rewriting some of the model’s code. The torch::deploy API allows multiple Python interpreters, each with its own GIL, in the same process(similar to PEP 684). However, torch::deploy has limited support for Python modules that use C-API extensions.

Motivation Summary

Python’s global interpreter lock makes it difficult to use modern multi-core CPUs efficiently for many scientific and numeric computing applications. Heinrich Kuttler, Manuel Kroiss, and Paweł Jurgielewicz found that multi-threaded implementations in Python did not scale well for their tasks and that using multiple processes was not a suitable alternative.

The scaling bottlenecks are not solely in core numeric tasks. Both Zachary DeVito and Paweł Jurgielewicz described challenges with coordination and communication in Python.

Olivier Grisel, Ralf Gommers, and Zachary DeVito described how current workarounds for the GIL are “complex to maintain” and cause “lower developer productivity.” The GIL makes it more difficult to develop and maintain scientific and numeric computing libraries as well leading to library designs that are more difficult to use.

以下略。

================================================

相关:

https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsDFosB5e6BfnXLlejd9l0/edit?pli=1#heading=h.jcxfoklnvp0i

关于python的GIL的解除——PEP 703 – Making the Global Interpreter Lock Optional in CPython的更多相关文章

  1. Python GIL(Global Interpreter Lock)

    一,介绍 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native t ...

  2. python之GIL官方文档 global interpreter lock 全局解释器锁

    0.目录 2. 术语 global interpreter lock 全局解释器锁3. C-API 还有更多没有仔细看4. 定期切换线程5. wiki.python6. python.doc FAQ ...

  3. Python GIL(Global Interpreter Lock)

    一.介绍 In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threa ...

  4. python之GIL(Global Interpreter Lock)

    一 介绍 ''' 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple nati ...

  5. python GIL(Global Interpreter Lock)

    一 介绍 ''' 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple nati ...

  6. Python解释器是单线程应用 IO 密集型 计算密集型 GIL global interpreter lock

    [Python解释器是单线程应用] [任意时刻,仅执行一个线程] 尽管Python解释器中可以运行多个线程,但是在任意给定的时刻只有一个线程会被解释器执行. [GIL锁 保证同时只有一个线程运行] 对 ...

  7. Python3 GIL(Global Interpreter Lock)与多线程

    GIL(Global Interpreter Lock)与多线程 GIL介绍 GIL与Lock GIL与多线程 多线程性能测试 在Cpython解释器中,同一个进程下开启的多线程,同一时刻只能有一个线 ...

  8. 基于Cpython的 GIL(Global Interpreter Lock)

    一 介绍 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native t ...

  9. GIL - global interpreter lock

    python是一个解释型语言,但是可以使用多个解释器.比如C++,但是可以用不同的编译器来编译成可执行代码.有名的编译器例如GCC,INTEL C++,Visual C++等.Python也一样,同样 ...

  10. [转载] Python的GIL是什么鬼,多线程性能究竟如何

    原文: http://cenalulu.github.io/python/gil-in-python/ GIL是什么 首先需要明确的一点是GIL并不是Python的特性,它是在实现Python解析器( ...

随机推荐

  1. Vue3:介绍

    Vue 3 相较于 Vue 2 在多个方面进行了改进和优化,主要优势包括但不限于以下几个方面: 响应式系统优化: Vue 3 引入了基于 Proxy 的响应式系统,取代了 Vue 2 中基于 Obje ...

  2. mysql GROUP_CONCAT给每个值加上单引号后再拼接

    经常使用group_concat拼接数值,但有一些中文在拼接时添加单引号会比较好, 该怎么操作呢? 可以使用如下语句,在字段前添加四个单引号和逗号,并在字段后也添加一个引号和四个单引号: 1 SELE ...

  3. vue饼图

    结果图 原型 1 <template> 2 <!-- 左右柱状图 --> 3 <div ref="rankEcharts" :style=" ...

  4. 阿里也出手了!Spring CloudAlibaba AI问世了

    写在前面 在之前的文章中我们有介绍过SpringAI这个项目.SpringAI 是Spring 官方社区项目,旨在简化 Java AI 应用程序开发, 让 Java 开发者想使用 Spring 开发普 ...

  5. python 发起PUT请求,报"Method not Allowed" 和 取返回的报文的内容

    发起请求的时候,默认使用的POST请求方式,导致发起请求,返回[405 Method not Allowed ],检查此更新接口的请求方式为PUT,更改请求方式为PUT PUT接口返回的内容,不能通过 ...

  6. 如何在Spring Boot框架下实现高效的Excel服务端导入导出?

    前言 Spring Boot是由Pivotal团队提供的全新框架,其设计目的是用来简化新Spring应用的初始搭建以及开发过程.该框架使用了特定的方式来进行配置,从而使开发人员不再需要定义样板化的配置 ...

  7. Notepad++ 搭建简单Java编译运行环境

    简介 有时候使用Eclips进行Java相关方法的测试和验证太繁琐,经过查询实践,使用了Notepad++和JDK搭建了一个简单的编译运行环境. 搭建过程 在电脑上安装Java环境(网上教程很多,此过 ...

  8. 浙江省赛决赛 misc2 蝎子

    Misc 2 tcp.stream eq 0 内得知是冰蝎3.0,key是e46023a69f8db309 <?php @error_reporting(0); session_start(); ...

  9. Servlet之Request和Response的快速上手

    阅读提示: 前置内容 MyBatis知识点总结 HTTP和Servlet入门 目录 1.Request和Response概述 2.Request对象 2.1 Request继承体系 2.2 Reque ...

  10. C# 线程与进程

    一.前台线程与后台线程对象 为什么要用多线程? 1.让计算机"同时"做多件事情,节约时间. 2.多线程可以让一个程序"同时"处理多个事情. 3.后台运行程序,提 ...