balance_dirty_pages_ratelimited分析
balance_dirty_pages_ratelimited分析
- nr_dirtied_pause:当前task的脏页门限;
- dirty_exceeded:全局的脏页数超过门限或者该bdi的脏页数超过门限;(dirty_exceeded = (bdi_dirty > bdi_thresh) &&((nr_dirty > dirty_thresh) || strictlimit); )
- bdp_ratelimits:percpu变量,当前CPU的脏页数
- ratelimit_pages:CPU的脏页门限
调用balance_dirty_pages的条件有:
1:当前task的脏页数量大于ratelimit ,(如果dirty_exceeded为0,则为current->nr_dirtied_pause;如果dirty_exceeded为1,则最大为32KB)
2:当前CPU的脏页数超过了门限值ratelimit_pages;
3:当前脏页数+退出线程遗留的脏页超过了门限;
void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
int ratelimit;
int *p;
if (!bdi_cap_account_dirty(bdi))
return;
ratelimit = current->nr_dirtied_pause; /* 门限:初始值为32表示128KB */
if (bdi->dirty_exceeded) /* 如果该值设置了,则需要通过降低平衡触发的门限来加速脏页回收 */
ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); /* 重新修改门限,最大为32KB,初始值128KB,加快回收 */
preempt_disable();
/*
* This prevents one CPU to accumulate too many dirtied pages without
* calling into balance_dirty_pages(), which can happen when there are
* 1000+ tasks, all of them start dirtying pages at exactly the same
* time, hence all honoured too large initial task->nr_dirtied_pause.
*/
/* 即保证当前线程脏页数超过门限,或者当前CPU超过门限,都要回收 */
p = this_cpu_ptr(&bdp_ratelimits); /* 当前CPU的脏页计数 */
if (unlikely(current->nr_dirtied >= ratelimit)) /* 如果当前线程脏页数超过门限值,则肯定会触发下面的回收流程。同时重新计算当前CPU的脏页数 */
*p = 0;
else if (unlikely(*p >= ratelimit_pages)) { /* 默认值为32页 */ /* 当前线程的脏页数未超过门限值,但是当前CPU的脏页数超过CPU脏页门限值,则设置门限为0,肯定会触发回收。同时重新计算当前CPU的脏页数 */
*p = 0;
ratelimit = 0;
}
/*
* Pick up the dirtied pages by the exited tasks. This avoids lots of
* short-lived tasks (eg. gcc invocations in a kernel build) escaping
* the dirty throttling and livelock other long-run dirtiers.
*/
p = this_cpu_ptr(&dirty_throttle_leaks); /* 退出的线程,也放在这里处理 */
if (*p > 0 && current->nr_dirtied < ratelimit) {
unsigned long nr_pages_dirtied;
nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
*p -= nr_pages_dirtied;
current->nr_dirtied += nr_pages_dirtied;
}
preempt_enable();
if (unlikely(current->nr_dirtied >= ratelimit)) /* 当前线程脏页超过门限值 */
balance_dirty_pages(mapping, current->nr_dirtied);
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
正常情况下应该是周期回收和背景回收,不会占用当前task的时间。但是当dirty > dirty_freerun_ceiling(thresh, bg_thresh) 即脏页数大于直接回收门限和背景回收门限的1/2时,需要将当前CPU休眠一会,让回收线程工作。
但是dirty <= dirty_freerun_ceiling(thresh, bg_thresh),也会动态的调整nr_dirtied_pause ,号让其更好的回收,调整的策略为:
static unsigned long dirty_poll_interval(unsigned long dirty,
unsigned long thresh)
{
/* */
if (thresh > dirty) /* */
return 1UL << (ilog2(thresh - dirty) >> 1);
return 1; /* 脏页数超过门限值,则返回1页就需要回收 */
}
至于为什么这么做,可以参考如下解析:
/*
Ideally if we know there are N dirtiers, it’s safe to let each task
poll at (thresh-dirty)/N without exceeding the dirty limit.
However we neither know the current N, nor is sure whether it will
rush high at next second. So sqrt is used to tolerate larger N on
increased (thresh-dirty) gap:
irb> 0.upto(10) { |i| mb=2**i; pages=mb<<(20-12); printf “%4d\t%4d\n”, mb, Math.sqrt(pages)}
1 16
2 22
4 32
8 45
16 64
32 90
64 128
128 181
256 256
512 362
1024 512
The above table means, given 1MB (or 1GB) gap and the dd tasks polling
balance_dirty_pages() on every 16 (or 512) pages, the dirty limit
won’t be exceeded as long as there are less than 16 (or 512) concurrent
dd’s.
Note that dirty_poll_interval() will mainly be used when (dirty < freerun).
When the dirty pages are floating in range [freerun, limit],
“[PATCH 14/18] writeback: control dirty pause time” will independently
adjust tsk->nr_dirtied_pause to get suitable pause time.
So the sqrt naturally leads to less overheads and more N tolerance for
large memory servers, which have large (thresh-freerun) gaps.
*/
void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
{
/* 可用内存并不是系统所有内存,而是free pages + reclaimable pages(文件页) */
const unsigned long available_memory = global_dirtyable_memory();
unsigned long background;
unsigned long dirty;
struct task_struct *tsk;
if (vm_dirty_bytes)
dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
else
dirty = (vm_dirty_ratio * available_memory) / 100;
if (dirty_background_bytes)
background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
else
background = (dirty_background_ratio * available_memory) / 100;
if (background >= dirty)
background = dirty / 2;
tsk = current;
if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { /* 如果设置了该属性PF_LESS_THROTTLE或者是实时线程,门限稍微提高1/4 */
background += background / 4;
dirty += dirty / 4;
}
*pbackground = background;
*pdirty = dirty;
trace_global_dirty_state(background, dirty);
}
static unsigned long global_dirtyable_memory(void)
{
unsigned long x;
/* 可用内存并不是系统所有内存,而是free pages + file pages(文件页) */
x = global_page_state(NR_FREE_PAGES);
x -= min(x, dirty_balance_reserve);
x += global_page_state(NR_INACTIVE_FILE);
x += global_page_state(NR_ACTIVE_FILE);
if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x);
return x + 1; /* Ensure that we never return 0 */
}
1:如果可回收+正在回写脏页数量 < background和显式回写阈值的均值此次先不启动回写,否则启动background回写
2:如果可回收的脏页数大于背景回收门限值,则触发背景回收执行;
static void balance_dirty_pages(struct address_space *mapping,
unsigned long pages_dirtied)
{
unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */
unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */
unsigned long background_thresh;
unsigned long dirty_thresh;
long period;
long pause;
long max_pause;
long min_pause;
int nr_dirtied_pause;
bool dirty_exceeded = false;
unsigned long task_ratelimit;
unsigned long dirty_ratelimit;
unsigned long pos_ratio;
struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; //单独门限值回收
unsigned long start_time = jiffies;
for (;;) {
unsigned long now = jiffies;
unsigned long uninitialized_var(bdi_thresh);
unsigned long thresh;
unsigned long uninitialized_var(bdi_dirty);
unsigned long dirty;
unsigned long bg_thresh;
/*
* Unstable writes are a feature of certain networked
* filesystems (i.e. NFS) in which data may have been
* written to the server's write cache, but has not yet
* been flushed to permanent storage.
*/
nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS); /* 全局 文件脏页 + 网络文件系统 */ /* = file_dirty + unstable_nfs */
nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); /*全局 文件总的脏页+包括正在回写 */ /* = file_dirty + writeback + unstable_nfs */
global_dirty_limits(&background_thresh, &dirty_thresh);//获取两个门限值
if (unlikely(strictlimit)) { /* 单独bdi回收 */
bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
&bdi_dirty, &bdi_thresh, &bg_thresh);
dirty = bdi_dirty;
thresh = bdi_thresh;
} else { /* 全局回收 */
dirty = nr_dirty; /* 全局 文件总的脏页+包括正在回写 */
thresh = dirty_thresh;
bg_thresh = background_thresh;
}
/*
* Throttle it only when the background writeback cannot
* catch-up. This avoids (excessively) small writeouts
* when the bdi limits are ramping up in case of !strictlimit.
*
* In strictlimit case make decision based on the bdi counters
* and limits. Small writeouts when the bdi limits are ramping
* up are the price we consciously pay for strictlimit-ing.
*/
/* 小于直接回收文件和背景回收的/2, 不占用本线程时间;否则说明背景回收没有运行,需要占用本线程时间, */
if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) { //(thresh + bg_thresh) / 2; 不回收
current->dirty_paused_when = now;
current->nr_dirtied = 0; /* 脏页数量重新置0 */
current->nr_dirtied_pause =
dirty_poll_interval(dirty, thresh); /* 重新设置线程脏页门限 */
break;
}
if (unlikely(!writeback_in_progress(bdi))) /* 唤醒真正的回写线程 */
bdi_start_background_writeback(bdi);
if (!strictlimit)
bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
&bdi_dirty, &bdi_thresh, NULL);
//nr_dirty > dirty_thresh
/*
* 如果是单个bdi独自回收,当前bdi的 脏页超过门限即回收;
* 如果是整个系统回收,当前bdi超过门限且系统的脏页也要超超过门限;
*/
dirty_exceeded = (bdi_dirty > bdi_thresh) &&
((nr_dirty > dirty_thresh) || strictlimit); //超过门限
if (dirty_exceeded && !bdi->dirty_exceeded)
bdi->dirty_exceeded = 1; //超过门限,后面需要加速回收
bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
nr_dirty, bdi_thresh, bdi_dirty,
start_time);
dirty_ratelimit = bdi->dirty_ratelimit;
pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
background_thresh, nr_dirty,
bdi_thresh, bdi_dirty);
task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
RATELIMIT_CALC_SHIFT;
max_pause = bdi_max_pause(bdi, bdi_dirty);
min_pause = bdi_min_pause(bdi, max_pause,
task_ratelimit, dirty_ratelimit,
&nr_dirtied_pause);
if (unlikely(task_ratelimit == 0)) {
period = max_pause;
pause = max_pause;
goto pause;
}
period = HZ * pages_dirtied / task_ratelimit;
pause = period;
if (current->dirty_paused_when)
pause -= now - current->dirty_paused_when;
/*
* For less than 1s think time (ext3/4 may block the dirtier
* for up to 800ms from time to time on 1-HDD; so does xfs,
* however at much less frequency), try to compensate it in
* future periods by updating the virtual time; otherwise just
* do a reset, as it may be a light dirtier.
*/
if (pause < min_pause) {
trace_balance_dirty_pages(bdi,
dirty_thresh,
background_thresh,
nr_dirty,
bdi_thresh,
bdi_dirty,
dirty_ratelimit,
task_ratelimit,
pages_dirtied,
period,
min(pause, 0L),
start_time);
if (pause < -HZ) {
current->dirty_paused_when = now;
current->nr_dirtied = 0;
} else if (period) {
current->dirty_paused_when += period;
current->nr_dirtied = 0;
} else if (current->nr_dirtied_pause <= pages_dirtied)
current->nr_dirtied_pause += pages_dirtied;
break;
}
if (unlikely(pause > max_pause)) {
/* for occasional dropped task_ratelimit */
now += min(pause - max_pause, max_pause);
pause = max_pause;
}
pause:
trace_balance_dirty_pages(bdi,
dirty_thresh,
background_thresh,
nr_dirty,
bdi_thresh,
bdi_dirty,
dirty_ratelimit,
task_ratelimit,
pages_dirtied,
period,
pause,
start_time);
__set_current_state(TASK_KILLABLE);
io_schedule_timeout(pause);//有可能会切出去,但最大超过200ms
current->dirty_paused_when = now + pause;
current->nr_dirtied = 0;
current->nr_dirtied_pause = nr_dirtied_pause;
/*
* This is typically equal to (nr_dirty < dirty_thresh) and can
* also keep "1000+ dd on a slow USB stick" under control.
*/
if (task_ratelimit)
break;
/*
* In the case of an unresponding NFS server and the NFS dirty
* pages exceeds dirty_thresh, give the other good bdi's a pipe
* to go through, so that tasks on them still remain responsive.
*
* In theory 1 page is enough to keep the comsumer-producer
* pipe going: the flusher cleans 1 page => the task dirties 1
* more page. However bdi_dirty has accounting errors. So use
* the larger and more IO friendly bdi_stat_error.
*/
if (bdi_dirty <= bdi_stat_error(bdi))
break;
if (fatal_signal_pending(current))
break;
}
if (!dirty_exceeded && bdi->dirty_exceeded) //如果不超过门限,则置0
bdi->dirty_exceeded = 0;
if (writeback_in_progress(bdi)) //正在回收,则退出
return;
/*
* In laptop mode, we wait until hitting the higher threshold before
* starting background writeout, and then write out all the way down
* to the lower threshold. So slow writers cause minimal disk activity.
*
* In normal mode, we start background writeout at the lower
* background_thresh, to keep the amount of dirty memory low.
*/
/*
* 节能模式,起到什么作用呢??
*/
if (laptop_mode)
return;
if (nr_reclaimable > background_thresh) //可回收的页面大于background_thresh,则触发线程异步回收
bdi_start_background_writeback(bdi);
}
balance_dirty_pages_ratelimited分析的更多相关文章
- Linux Kernel文件系统写I/O流程代码分析(一)
Linux Kernel文件系统写I/O流程代码分析(一) 在Linux VFS机制简析(二)这篇博客上介绍了struct address_space_operations里底层文件系统需要实现的操作 ...
- 用户空间缺页异常pte_handle_fault()分析--(上)【转】
转自:http://blog.csdn.net/vanbreaker/article/details/7881206 版权声明:本文为博主原创文章,未经博主允许不得转载. 前面简单的分析了内核处理用户 ...
- alias导致virtualenv异常的分析和解法
title: alias导致virtualenv异常的分析和解法 toc: true comments: true date: 2016-06-27 23:40:56 tags: [OS X, ZSH ...
- 火焰图分析openresty性能瓶颈
注:本文操作基于CentOS 系统 准备工作 用wget从https://sourceware.org/systemtap/ftp/releases/下载最新版的systemtap.tar.gz压缩包 ...
- 一起来玩echarts系列(一)------箱线图的分析与绘制
一.箱线图 Box-plot 箱线图一般被用作显示数据分散情况.具体是计算一组数据的中位数.25%分位数.75%分位数.上边界.下边界,来将数据从大到小排列,直观展示数据整体的分布情况. 大部分正常数 ...
- 应用工具 .NET Portability Analyzer 分析迁移dotnet core
大多数开发人员更喜欢一次性编写好业务逻辑代码,以后再重用这些代码.与构建不同的应用以面向多个平台相比,这种方法更加容易.如果您创建与 .NET Core 兼容的.NET 标准库,那么现在比以往任何时候 ...
- UWP中新加的数据绑定方式x:Bind分析总结
UWP中新加的数据绑定方式x:Bind分析总结 0x00 UWP中的x:Bind 由之前有过WPF开发经验,所以在学习UWP的时候直接省略了XAML.数据绑定等几个看着十分眼熟的主题.学习过程中倒是也 ...
- 查看w3wp进程占用的内存及.NET内存泄露,死锁分析
一 基础知识 在分析之前,先上一张图: 从上面可以看到,这个w3wp进程占用了376M内存,启动了54个线程. 在使用windbg查看之前,看到的进程含有 *32 字样,意思是在64位机器上已32位方 ...
- ZIP压缩算法详细分析及解压实例解释
最近自己实现了一个ZIP压缩数据的解压程序,觉得有必要把ZIP压缩格式进行一下详细总结,数据压缩是一门通信原理和计算机科学都会涉及到的学科,在通信原理中,一般称为信源编码,在计算机科学里,一般称为数据 ...
- ABP源码分析一:整体项目结构及目录
ABP是一套非常优秀的web应用程序架构,适合用来搭建集中式架构的web应用程序. 整个Abp的Infrastructure是以Abp这个package为核心模块(core)+15个模块(module ...
随机推荐
- RHCA cl210 016 流表 overlay
Overlay网络是建立在Underlay网络上的逻辑网络 underlay br-int 之间建立隧道 数据流量还是从eth1出去 只有vlan20 是geneve隧道.只有租户网络有子网,子网需要 ...
- 基于禅道数据库对bug进行不同维度统计
工作中经常需要在周报.月报.年报对禅道bug数据进行不同维度统计导出,以下是我常用的统计sql 1.统计2022年每个月bug数(deleted='0'是查询未删除的bug) select DATE_ ...
- Windows安装虚拟机软件-VirtualBox
1.VirtualBox简介 VirtualBox号称是最强的开源免费虚拟机软件,它不仅具有丰富的特色,而且性能也很优异. 它简单易用,可虚拟的系统包括Windows.Mac OS X.Linux.O ...
- 【TypeScript】01 基础入门
前提:使用TypeScript你需要安装NodeJS支持 然后安装TypeScript: npm intsall -g typescript 安装完成后查看版本号: tsc -v 新建一个TypeSc ...
- 【转载】手动DIY制作机械臂
相关链接: https://news.cnblogs.com/n/703664/ https://www.bilibili.com/video/BV12341117rG https://www.cnb ...
- 结合实例看 maven 传递依赖与优先级,难顶也得上丫
开心一刻 想买摩托车了,但是钱不够,想找老爸借点 我:老爸,我想买一辆摩托车,上下班也方便 老爸:你表哥上个月骑摩托车摔走了,你不知道?还要买摩托车? 我:对不起,我不买了 老板:就是啊,骑你表哥那辆 ...
- Apache DolphinScheduler数仓任务管理规范
前言: 大数据领域对多种任务都有调度需求,以离线数仓的任务应用最多,许多团队在调研开源产品后,选择Apache DolphinScheduler(以下简称DS)作为调度场景的技术选型.得益于DS优秀的 ...
- 为什么大部分的 PHP 程序员转不了 Go 语言?
大家好,我是码农先森. 树挪死,人挪活,这个需求我做不了,换个人吧.大家都有过这种经历吧,放在编程语言身上就是 PHP 不行了,赶紧转 Go 语言吧.那转 Go 语言就真的行了?那可不见得,我个人认为 ...
- Synology NAS GitLab 配置
安装 安装的时候会提示服务器名.root用户名等,这步服务器名千万不要写错,不然会登不上去,提示 502. root 密码 网上有很多说 root 密码怎么获取的,但是都不适用. 实际上是第一个访问 ...
- SMCA:港中文提出注意力图校准的DETR加速方案 | ICCV 2021
为了加速DETR收敛,论文提出了简单而有效的Spatially Modulated Co-Attention(SMCA)机制,通过在初始边界框位置给予较高的协同注意力响应值的约束来构建DETR的回归感 ...