关于python的GIL的解除——PEP 703 – Making the Global Interpreter Lock Optional in CPython
PEP地址:
https://peps.python.org/pep-0703/
PEP 703 – Making the Global Interpreter Lock Optional in CPython
================================================
Abstract
CPython’s global interpreter lock (“GIL”) prevents multiple threads from executing Python code at the same time. The GIL is an obstacle to using multi-core CPUs from Python efficiently. This PEP proposes adding a build configuration (--disable-gil) to CPython to let it run Python code without the global interpreter lock and with the necessary changes needed to make the interpreter thread-safe.
Motivation
The GIL is a major obstacle to concurrency. For scientific computing tasks, this lack of concurrency is often a bigger issue than speed of executing Python code, since most of the processor cycles are spent in optimized CPU or GPU kernels. The GIL introduces a global bottleneck that can prevent other threads from making progress if they call any Python code. There are existing ways to enable parallelism in CPython today, but those techniques come with significant limitations (see Alternatives).
This section focuses on the GIL’s impact on scientific computing, particular AI/ML workloads because that is the area with which this author has the most experience, but the GIL also affects other users of Python.
The GIL Makes Many Types of Parallelism Difficult to Express
Neural network-based AI models expose multiple opportunities for parallelism. For example, individual operations may be parallelized internally (“intra-operator”), multiple operations may be executed simultaneously (“inter-operator”), and requests (spanning multiple operations) may also be parallelized. Efficient execution requires exploiting multiple types of parallelism [1].
The GIL makes it difficult to express inter-operator parallelism, as well as some forms of request parallelism, efficiently in Python. In other programming languages, a system might use threads to run different parts of a neural network on separate CPU cores, but this is inefficient in Python due to the GIL. Similarly, latency-sensitive inference workloads frequently use threads to parallelize across requests, but face the same scaling bottlenecks in Python.
The challenges the GIL poses to exploiting parallelism in Python frequently come up in reinforcement learning. Heinrich Kuttler, author of the NetHack Learning Environment and Member of Technical Staff at Inflection AI, writes:
Recent breakthroughs in reinforcement learning, such as on Dota 2, StarCraft, and NetHack rely on running multiple environments (simulated games) in parallel using asynchronous actor-critic methods. Straightforward multithreaded implementations in Python don’t scale beyond more than a few parallel environments due to GIL contention. Multiprocessing, with communication via shared memory or UNIX sockets, adds much complexity and in effect rules out interacting with CUDA from different workers, severely restricting the design space.
Manuel Kroiss, software engineer at DeepMind on the reinforcement learning team, describes how the bottlenecks posed by the GIL lead to rewriting Python codebases in C++, making the code less accessible:
We frequently battle issues with the Python GIL at DeepMind. In many of our applications, we would like to run on the order of 50-100 threads per process. However, we often see that even with fewer than 10 threads the GIL becomes the bottleneck. To work around this problem, we sometimes use subprocesses, but in many cases the inter-process communication becomes too big of an overhead. To deal with the GIL, we usually end up translating large parts of our Python codebase into C++. This is undesirable because it makes the code less accessible to researchers.
Projects that involve interfacing with multiple hardware devices face similar challenges: efficient communication requires use of multiple CPU cores. The Dose-3D project aims to improve cancer radiotherapy with precise dose planning. It uses medical phantoms (stand-ins for human tissue) together with custom hardware and a server application written in Python. Paweł Jurgielewicz, lead software architect for the data acquisition system on the Dose-3D project, describes the scaling challenges posed by the GIL and how using a fork of Python without the GIL simplified the project:
In the Dose-3D project, the key challenge was to maintain a stable, non-trivial concurrent communication link with hardware units while utilizing a 1 Gbit/s UDP/IP connection to the maximum. Naturally, we started with the multiprocessing package, but at some point, it became clear that most CPU time was consumed by the data transfers between the data processing stages, not by data processing itself. The CPython multithreading implementation based on GIL was a dead end too. When we found out about the “nogil” fork of Python it took a single person less than half a working day to adjust the codebase to use this fork and the results were astonishing. Now we can focus on data acquisition system development rather than fine-tuning data exchange algorithms.
Allen Goodman, author of CellProfiler and staff engineer at Prescient Design and Genentech, describes how the GIL makes biological methods research more difficult in Python:
Issues with Python’s global interpreter lock are a frequent source of frustration throughout biological methods research.I wanted to better understand the current multithreading situation so I reimplemented parts of HMMER, a standard method for multiple-sequence alignment. I chose this method because it stresses both single-thread performance (scoring) and multi-threaded performance (searching a database of sequences). The GIL became the bottleneck when using only eight threads. This is a method where the current popular implementations rely on 64 or even 128 threads per process. I tried moving to subprocesses but was blocked by the prohibitive IPC costs. HMMER is a relatively elementary bioinformatics method and newer methods have far bigger multi-threading demands.
Method researchers are begging to use Python (myself included), because of its ease of use, the Python ecosystem, and because “it’s what people know.” Many biologists only know a little bit of programming (and that’s almost always Python). Until Python’s multithreading situation is addressed, C and C++ will remain the lingua franca of the biological methods research community.
The GIL Affects Python Library Usability
The GIL is a CPython implementation detail that limits multithreaded parallelism, so it might seem unintuitive to think of it as a usability issue. However, library authors frequently care a great deal about performance and will design APIs that support working around the GIL. These workaround frequently lead to APIs that are more difficult to use. Consequently, users of these APIs may experience the GIL as a usability issue and not just a performance issue.
For example, PyTorch exposes a multiprocessing-based API called DataLoader for building data input pipelines. It uses fork() on Linux because it is generally faster and uses less memory than spawn(), but this leads to additional challenges for users: creating a DataLoader after accessing a GPU can lead to confusing CUDA errors. Accessing GPUs within a DataLoader worker quickly leads to out-of-memory errors because processes do not share CUDA contexts (unlike threads within a process).
Olivier Grisel, scikit-learn developer and software engineer at Inria, describes how having to work around the GIL in scikit-learn related libraries leads to a more complex and confusing user experience:
Over the years, scikit-learn developers have maintained ancillary libraries such asjoblibandlokyto try to work around some of the limitations of multiprocessing: extra memory usage partially mitigated via semi-automated memory mapping of large data buffers, slow worker startup by transparently reusing a pool of long running workers, fork-safety problems of third-party native runtime libraries such as GNU OpenMP by never using the fork-only start-method, ability to perform parallel calls of interactively defined functions in notebooks and REPLs in cross-platform manner via cloudpickle. Despite our efforts, this multiprocessing-based solution is still brittle, complex to maintain and confusing to datascientists with limited understanding of system-level constraints. Furthermore, there are still irreducible limitations such as the overhead caused by the pickle-based serialization/deserialization steps required for inter-process communication. A lot of this extra work and complexity would not be needed anymore if we could use threads without contention on multicore hosts (sometimes with 64 physical cores or more) to run data science pipelines that alternate between Python-level operations and calls to native libraries.
Ralf Gommers, co-director of Quansight Labs and NumPy and SciPy maintainer, describes how the GIL affects the user experience of NumPy and numeric Python libraries:
A key problem in NumPy and the stack of packages built around it is that NumPy is still (mostly) single-threaded — and that has shaped significant parts of the user experience and projects built around it. NumPy does release the GIL in its inner loops (which do the heavy lifting), but that is not nearly enough. NumPy doesn’t offer a solution to utilize all CPU cores of a single machine well, and instead leaves that to Dask and other multiprocessing solutions. Those aren’t very efficient and are also more clumsy to use. That clumsiness comes mainly in the extra abstractions and layers the users need to concern themselves with when using, e.g.,dask.arraywhich wrapsnumpy.ndarray. It also shows up in oversubscription issues that the user must explicitly be aware of and manage via either environment variables or a third package,threadpoolctl. The main reason is that NumPy calls into BLAS for linear algebra - and those calls it has no control over, they do use all cores by default via either pthreads or OpenMP.Coordinating on APIs and design decisions to control parallelism is still a major amount of work, and one of the harder challenges across the PyData ecosystem. It would have looked a lot different (better, easier) without a GIL.
GPU-Heavy Workloads Require Multi-Core Processing
Many high-performance computing (HPC) and AI workloads make heavy use of GPUs. These applications frequently require efficient multi-core CPU execution even though the bulk of the computation runs on a GPU.
Zachary DeVito, PyTorch core developer and researcher at FAIR (Meta AI), describes how the GIL makes multithreaded scaling inefficient even when the bulk of computation is performed outside of Python:
In PyTorch, Python is commonly used to orchestrate ~8 GPUs and ~64 CPU threads, growing to 4k GPUs and 32k CPU threads for big models. While the heavy lifting is done outside of Python, the speed of GPUs makes even just the orchestration in Python not scalable. We often end up with 72 processes in place of one because of the GIL. Logging, debugging, and performance tuning are orders-of-magnitude more difficult in this regime, continuously causing lower developer productivity.
The use of many processes (instead of threads) makes common tasks more difficult. Zachary DeVito continues:
On three separate occasions in the past couple of months (reducing redundant compute in data loaders, writing model checkpoints asynchronously, and parallelizing compiler optimizations), I spent an order-of-magnitude more time figuring out how to work around GIL limitations than actually solving the particular problem.
Even GPU-heavy workloads frequently have a CPU-intensive component. For example, computer vision tasks typically require multiple “pre-processing” steps in the data input pipeline, like image decoding, cropping, and resizing. These tasks are commonly performed on the CPU and may use Python libraries like Pillow or Pillow-SIMD. It is necessary to run the data input pipeline on multiple CPU cores in order to keep the GPU “fed” with data.
The increase in GPU performance compared to individual CPU cores makes multi-core performance more important. It is progressively more difficult to keep the GPUs fully occupied. To do so requires efficient use of multiple CPU cores, especially on multi-GPU systems. For example, NVIDIA’s DGX-A100 has 8 GPUs and two 64-core CPUs in order to keep the GPUs “fed” with data.
The GIL Makes Deploying Python AI Models Difficult
Python is widely used to develop neural network-based AI models. In PyTorch, models are frequently deployed as part of multi-threaded, mostly C++, environments. Python is often viewed skeptically because the GIL can be a global bottleneck, preventing efficient scaling even though the vast majority of the computations occur “outside” of Python with the GIL released. The torchdeploy paper [2] shows experimental evidence for these scaling bottlenecks in multiple model architectures.
PyTorch provides a number of mechanisms for deploying Python AI models that avoid or work around the GIL, but they all come with substantial limitations. For example, TorchScript captures a representation of the model that can be executed from C++ without any Python dependencies, but it only supports a limited subset of Python and often requires rewriting some of the model’s code. The torch::deploy API allows multiple Python interpreters, each with its own GIL, in the same process(similar to PEP 684). However, torch::deploy has limited support for Python modules that use C-API extensions.
Motivation Summary
Python’s global interpreter lock makes it difficult to use modern multi-core CPUs efficiently for many scientific and numeric computing applications. Heinrich Kuttler, Manuel Kroiss, and Paweł Jurgielewicz found that multi-threaded implementations in Python did not scale well for their tasks and that using multiple processes was not a suitable alternative.
The scaling bottlenecks are not solely in core numeric tasks. Both Zachary DeVito and Paweł Jurgielewicz described challenges with coordination and communication in Python.
Olivier Grisel, Ralf Gommers, and Zachary DeVito described how current workarounds for the GIL are “complex to maintain” and cause “lower developer productivity.” The GIL makes it more difficult to develop and maintain scientific and numeric computing libraries as well leading to library designs that are more difficult to use.
以下略。
================================================
相关:
关于python的GIL的解除——PEP 703 – Making the Global Interpreter Lock Optional in CPython的更多相关文章
- Python GIL(Global Interpreter Lock)
一,介绍 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native t ...
- python之GIL官方文档 global interpreter lock 全局解释器锁
0.目录 2. 术语 global interpreter lock 全局解释器锁3. C-API 还有更多没有仔细看4. 定期切换线程5. wiki.python6. python.doc FAQ ...
- Python GIL(Global Interpreter Lock)
一.介绍 In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threa ...
- python之GIL(Global Interpreter Lock)
一 介绍 ''' 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple nati ...
- python GIL(Global Interpreter Lock)
一 介绍 ''' 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple nati ...
- Python解释器是单线程应用 IO 密集型 计算密集型 GIL global interpreter lock
[Python解释器是单线程应用] [任意时刻,仅执行一个线程] 尽管Python解释器中可以运行多个线程,但是在任意给定的时刻只有一个线程会被解释器执行. [GIL锁 保证同时只有一个线程运行] 对 ...
- Python3 GIL(Global Interpreter Lock)与多线程
GIL(Global Interpreter Lock)与多线程 GIL介绍 GIL与Lock GIL与多线程 多线程性能测试 在Cpython解释器中,同一个进程下开启的多线程,同一时刻只能有一个线 ...
- 基于Cpython的 GIL(Global Interpreter Lock)
一 介绍 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native t ...
- GIL - global interpreter lock
python是一个解释型语言,但是可以使用多个解释器.比如C++,但是可以用不同的编译器来编译成可执行代码.有名的编译器例如GCC,INTEL C++,Visual C++等.Python也一样,同样 ...
- [转载] Python的GIL是什么鬼,多线程性能究竟如何
原文: http://cenalulu.github.io/python/gil-in-python/ GIL是什么 首先需要明确的一点是GIL并不是Python的特性,它是在实现Python解析器( ...
随机推荐
- vue排行榜 加单位
- 微信支付普通商户与AppID账号关联管理-授权
微信支付普通商户与AppID账号关联管理 二.名词解释 名词 释义 微信支付普通商户 公司企业.政府机关.事业单位.社会组织.个体工商户.个人卖家.小微商户.(微信支付商户接入指引) AppID 已通 ...
- LocalDateTime与LocalDate之间转换
LocalDateTime与LocalDate之间转换 //LocalDateTime转换LocalDate LocalDateTime now2 = LocalDateTime.now(); Loc ...
- 部署jar项目服务命令
部署jar项目服务命令首先使用jenkins打包jar history | grep java 查看ps aux | grep 服务关键字关闭进程,否则启动的时候报错:java.net.BindExc ...
- FFmpeg如何将一个gif嵌入视频指定位置并指定显示时间
背景 很简单的需求:我需要将一个gif嵌入到视频里面的指定位置,并要指定时间播放: 环境 windows11 64位专业版 ffmpeg version 2022-04-07-git-607ecc27 ...
- Android程序获取鸿蒙手机设备信息(是否鸿蒙手机、版本号、小版本号等)
1.效果图 鸿蒙手机 --> 关于手机的截图: Android程序获取鸿蒙手机设备信息的截图: 2.实现 本案例DEMO的实现主要借鉴了网上现有的资料: https://blog.csdn.ne ...
- NXP i.MX 8M Mini工业核心板硬件说明书(四核ARM Cortex-A53 + 单核ARM Cortex-M4,主频1.6GHz)
1 硬件资源 创龙科技SOM-TLIMX8是一款基于NXP i.MX 8M Mini的四核ARM Cortex-A53 + 单核ARM Cortex-M4异构多核处理器设计的高端工业 ...
- AI Agent框架(LLM Agent):LLM驱动的智能体如何引领行业变革,应用探索与未来展望
AI Agent框架(LLM Agent):LLM驱动的智能体如何引领行业变革,应用探索与未来展望 1. AI Agent(LLM Agent)介绍 1.1. 术语 Agent:"代理&qu ...
- MQTT协议介绍与Broker列表
MQTT协议介绍 MQTT是什么? MQTT 是基于 Publish/Subscribe(发布/订阅) 模式的物联网通信协议,凭借简单易实现.支持 QoS.报文小等特点. 官网:https://mqt ...
- 松灵机器人scout mini小车 自主导航(2)——仿真指南
松灵机器人Scout mini小车仿真指南 之前介绍了如何通过CAN TO USB串口实现用键盘控制小车移动.但是一直用小车测试缺乏安全性.而松灵官方贴心的为我们准备了gazebo仿真环境,提供了完整 ...