balance_dirty_pages_ratelimited分析

  • nr_dirtied_pause:当前task的脏页门限;
  • dirty_exceeded:全局的脏页数超过门限或者该bdi的脏页数超过门限;(dirty_exceeded = (bdi_dirty > bdi_thresh) &&((nr_dirty > dirty_thresh) || strictlimit); )
  • bdp_ratelimits:percpu变量,当前CPU的脏页数
  • ratelimit_pages:CPU的脏页门限

调用balance_dirty_pages的条件有:

1:当前task的脏页数量大于ratelimit ,(如果dirty_exceeded为0,则为current->nr_dirtied_pause;如果dirty_exceeded为1,则最大为32KB)

2:当前CPU的脏页数超过了门限值ratelimit_pages;

3:当前脏页数+退出线程遗留的脏页超过了门限;

void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
int ratelimit;
int *p; if (!bdi_cap_account_dirty(bdi))
return; ratelimit = current->nr_dirtied_pause; /* 门限:初始值为32表示128KB */
if (bdi->dirty_exceeded) /* 如果该值设置了,则需要通过降低平衡触发的门限来加速脏页回收 */
ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); /* 重新修改门限,最大为32KB,初始值128KB,加快回收 */ preempt_disable();
/*
* This prevents one CPU to accumulate too many dirtied pages without
* calling into balance_dirty_pages(), which can happen when there are
* 1000+ tasks, all of them start dirtying pages at exactly the same
* time, hence all honoured too large initial task->nr_dirtied_pause.
*/
/* 即保证当前线程脏页数超过门限,或者当前CPU超过门限,都要回收 */
p = this_cpu_ptr(&bdp_ratelimits); /* 当前CPU的脏页计数 */
if (unlikely(current->nr_dirtied >= ratelimit)) /* 如果当前线程脏页数超过门限值,则肯定会触发下面的回收流程。同时重新计算当前CPU的脏页数 */
*p = 0;
else if (unlikely(*p >= ratelimit_pages)) { /* 默认值为32页 */ /* 当前线程的脏页数未超过门限值,但是当前CPU的脏页数超过CPU脏页门限值,则设置门限为0,肯定会触发回收。同时重新计算当前CPU的脏页数 */
*p = 0;
ratelimit = 0;
}
/*
* Pick up the dirtied pages by the exited tasks. This avoids lots of
* short-lived tasks (eg. gcc invocations in a kernel build) escaping
* the dirty throttling and livelock other long-run dirtiers.
*/
p = this_cpu_ptr(&dirty_throttle_leaks); /* 退出的线程,也放在这里处理 */
if (*p > 0 && current->nr_dirtied < ratelimit) {
unsigned long nr_pages_dirtied;
nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
*p -= nr_pages_dirtied;
current->nr_dirtied += nr_pages_dirtied;
}
preempt_enable(); if (unlikely(current->nr_dirtied >= ratelimit)) /* 当前线程脏页超过门限值 */
balance_dirty_pages(mapping, current->nr_dirtied);
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited);

正常情况下应该是周期回收和背景回收,不会占用当前task的时间。但是当dirty > dirty_freerun_ceiling(thresh, bg_thresh) 即脏页数大于直接回收门限和背景回收门限的1/2时,需要将当前CPU休眠一会,让回收线程工作。

但是dirty <= dirty_freerun_ceiling(thresh, bg_thresh),也会动态的调整nr_dirtied_pause ,号让其更好的回收,调整的策略为:

static unsigned long dirty_poll_interval(unsigned long dirty,
unsigned long thresh)
{
/* */
if (thresh > dirty) /* */
return 1UL << (ilog2(thresh - dirty) >> 1); return 1; /* 脏页数超过门限值,则返回1页就需要回收 */
}

至于为什么这么做,可以参考如下解析:

/*

Ideally if we know there are N dirtiers, it’s safe to let each task

poll at (thresh-dirty)/N without exceeding the dirty limit.

However we neither know the current N, nor is sure whether it will

rush high at next second. So sqrt is used to tolerate larger N on

increased (thresh-dirty) gap:

irb> 0.upto(10) { |i| mb=2**i; pages=mb<<(20-12); printf “%4d\t%4d\n”, mb, Math.sqrt(pages)}

1 16

2 22

4 32

8 45

16 64

32 90

64 128

128 181

256 256

512 362

1024 512

The above table means, given 1MB (or 1GB) gap and the dd tasks polling

balance_dirty_pages() on every 16 (or 512) pages, the dirty limit

won’t be exceeded as long as there are less than 16 (or 512) concurrent

dd’s.

Note that dirty_poll_interval() will mainly be used when (dirty < freerun).

When the dirty pages are floating in range [freerun, limit],

“[PATCH 14/18] writeback: control dirty pause time” will independently

adjust tsk->nr_dirtied_pause to get suitable pause time.

So the sqrt naturally leads to less overheads and more N tolerance for

large memory servers, which have large (thresh-freerun) gaps.

*/

void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
{
/* 可用内存并不是系统所有内存,而是free pages + reclaimable pages(文件页) */
const unsigned long available_memory = global_dirtyable_memory();
unsigned long background;
unsigned long dirty;
struct task_struct *tsk; if (vm_dirty_bytes)
dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
else
dirty = (vm_dirty_ratio * available_memory) / 100; if (dirty_background_bytes)
background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
else
background = (dirty_background_ratio * available_memory) / 100; if (background >= dirty)
background = dirty / 2;
tsk = current;
if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { /* 如果设置了该属性PF_LESS_THROTTLE或者是实时线程,门限稍微提高1/4 */
background += background / 4;
dirty += dirty / 4;
}
*pbackground = background;
*pdirty = dirty;
trace_global_dirty_state(background, dirty);
} static unsigned long global_dirtyable_memory(void)
{
unsigned long x; /* 可用内存并不是系统所有内存,而是free pages + file pages(文件页) */
x = global_page_state(NR_FREE_PAGES);
x -= min(x, dirty_balance_reserve); x += global_page_state(NR_INACTIVE_FILE);
x += global_page_state(NR_ACTIVE_FILE); if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x); return x + 1; /* Ensure that we never return 0 */
}

1:如果可回收+正在回写脏页数量 < background和显式回写阈值的均值此次先不启动回写,否则启动background回写

2:如果可回收的脏页数大于背景回收门限值,则触发背景回收执行;

static void balance_dirty_pages(struct address_space *mapping,
unsigned long pages_dirtied)
{
unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */
unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */
unsigned long background_thresh;
unsigned long dirty_thresh;
long period;
long pause;
long max_pause;
long min_pause;
int nr_dirtied_pause;
bool dirty_exceeded = false;
unsigned long task_ratelimit;
unsigned long dirty_ratelimit;
unsigned long pos_ratio;
struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; //单独门限值回收
unsigned long start_time = jiffies; for (;;) {
unsigned long now = jiffies;
unsigned long uninitialized_var(bdi_thresh);
unsigned long thresh;
unsigned long uninitialized_var(bdi_dirty);
unsigned long dirty;
unsigned long bg_thresh; /*
* Unstable writes are a feature of certain networked
* filesystems (i.e. NFS) in which data may have been
* written to the server's write cache, but has not yet
* been flushed to permanent storage.
*/
nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS); /* 全局 文件脏页 + 网络文件系统 */ /* = file_dirty + unstable_nfs */
nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); /*全局 文件总的脏页+包括正在回写 */ /* = file_dirty + writeback + unstable_nfs */ global_dirty_limits(&background_thresh, &dirty_thresh);//获取两个门限值 if (unlikely(strictlimit)) { /* 单独bdi回收 */
bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
&bdi_dirty, &bdi_thresh, &bg_thresh); dirty = bdi_dirty;
thresh = bdi_thresh;
} else { /* 全局回收 */
dirty = nr_dirty; /* 全局 文件总的脏页+包括正在回写 */
thresh = dirty_thresh;
bg_thresh = background_thresh;
} /*
* Throttle it only when the background writeback cannot
* catch-up. This avoids (excessively) small writeouts
* when the bdi limits are ramping up in case of !strictlimit.
*
* In strictlimit case make decision based on the bdi counters
* and limits. Small writeouts when the bdi limits are ramping
* up are the price we consciously pay for strictlimit-ing.
*/
/* 小于直接回收文件和背景回收的/2, 不占用本线程时间;否则说明背景回收没有运行,需要占用本线程时间, */
if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) { //(thresh + bg_thresh) / 2; 不回收
current->dirty_paused_when = now;
current->nr_dirtied = 0; /* 脏页数量重新置0 */
current->nr_dirtied_pause =
dirty_poll_interval(dirty, thresh); /* 重新设置线程脏页门限 */
break;
} if (unlikely(!writeback_in_progress(bdi))) /* 唤醒真正的回写线程 */
bdi_start_background_writeback(bdi); if (!strictlimit)
bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
&bdi_dirty, &bdi_thresh, NULL); //nr_dirty > dirty_thresh
/*
* 如果是单个bdi独自回收,当前bdi的 脏页超过门限即回收;
* 如果是整个系统回收,当前bdi超过门限且系统的脏页也要超超过门限;
*/
dirty_exceeded = (bdi_dirty > bdi_thresh) &&
((nr_dirty > dirty_thresh) || strictlimit); //超过门限 if (dirty_exceeded && !bdi->dirty_exceeded)
bdi->dirty_exceeded = 1; //超过门限,后面需要加速回收 bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
nr_dirty, bdi_thresh, bdi_dirty,
start_time); dirty_ratelimit = bdi->dirty_ratelimit;
pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
background_thresh, nr_dirty,
bdi_thresh, bdi_dirty);
task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
RATELIMIT_CALC_SHIFT;
max_pause = bdi_max_pause(bdi, bdi_dirty);
min_pause = bdi_min_pause(bdi, max_pause,
task_ratelimit, dirty_ratelimit,
&nr_dirtied_pause); if (unlikely(task_ratelimit == 0)) {
period = max_pause;
pause = max_pause;
goto pause;
}
period = HZ * pages_dirtied / task_ratelimit;
pause = period;
if (current->dirty_paused_when)
pause -= now - current->dirty_paused_when;
/*
* For less than 1s think time (ext3/4 may block the dirtier
* for up to 800ms from time to time on 1-HDD; so does xfs,
* however at much less frequency), try to compensate it in
* future periods by updating the virtual time; otherwise just
* do a reset, as it may be a light dirtier.
*/
if (pause < min_pause) {
trace_balance_dirty_pages(bdi,
dirty_thresh,
background_thresh,
nr_dirty,
bdi_thresh,
bdi_dirty,
dirty_ratelimit,
task_ratelimit,
pages_dirtied,
period,
min(pause, 0L),
start_time);
if (pause < -HZ) {
current->dirty_paused_when = now;
current->nr_dirtied = 0;
} else if (period) {
current->dirty_paused_when += period;
current->nr_dirtied = 0;
} else if (current->nr_dirtied_pause <= pages_dirtied)
current->nr_dirtied_pause += pages_dirtied;
break;
}
if (unlikely(pause > max_pause)) {
/* for occasional dropped task_ratelimit */
now += min(pause - max_pause, max_pause);
pause = max_pause;
} pause:
trace_balance_dirty_pages(bdi,
dirty_thresh,
background_thresh,
nr_dirty,
bdi_thresh,
bdi_dirty,
dirty_ratelimit,
task_ratelimit,
pages_dirtied,
period,
pause,
start_time);
__set_current_state(TASK_KILLABLE);
io_schedule_timeout(pause);//有可能会切出去,但最大超过200ms current->dirty_paused_when = now + pause;
current->nr_dirtied = 0;
current->nr_dirtied_pause = nr_dirtied_pause; /*
* This is typically equal to (nr_dirty < dirty_thresh) and can
* also keep "1000+ dd on a slow USB stick" under control.
*/
if (task_ratelimit)
break; /*
* In the case of an unresponding NFS server and the NFS dirty
* pages exceeds dirty_thresh, give the other good bdi's a pipe
* to go through, so that tasks on them still remain responsive.
*
* In theory 1 page is enough to keep the comsumer-producer
* pipe going: the flusher cleans 1 page => the task dirties 1
* more page. However bdi_dirty has accounting errors. So use
* the larger and more IO friendly bdi_stat_error.
*/
if (bdi_dirty <= bdi_stat_error(bdi))
break; if (fatal_signal_pending(current))
break;
} if (!dirty_exceeded && bdi->dirty_exceeded) //如果不超过门限,则置0
bdi->dirty_exceeded = 0; if (writeback_in_progress(bdi)) //正在回收,则退出
return; /*
* In laptop mode, we wait until hitting the higher threshold before
* starting background writeout, and then write out all the way down
* to the lower threshold. So slow writers cause minimal disk activity.
*
* In normal mode, we start background writeout at the lower
* background_thresh, to keep the amount of dirty memory low.
*/
/*
* 节能模式,起到什么作用呢??
*/
if (laptop_mode)
return; if (nr_reclaimable > background_thresh) //可回收的页面大于background_thresh,则触发线程异步回收
bdi_start_background_writeback(bdi);
}

balance_dirty_pages_ratelimited分析的更多相关文章

  1. Linux Kernel文件系统写I/O流程代码分析(一)

    Linux Kernel文件系统写I/O流程代码分析(一) 在Linux VFS机制简析(二)这篇博客上介绍了struct address_space_operations里底层文件系统需要实现的操作 ...

  2. 用户空间缺页异常pte_handle_fault()分析--(上)【转】

    转自:http://blog.csdn.net/vanbreaker/article/details/7881206 版权声明:本文为博主原创文章,未经博主允许不得转载. 前面简单的分析了内核处理用户 ...

  3. alias导致virtualenv异常的分析和解法

    title: alias导致virtualenv异常的分析和解法 toc: true comments: true date: 2016-06-27 23:40:56 tags: [OS X, ZSH ...

  4. 火焰图分析openresty性能瓶颈

    注:本文操作基于CentOS 系统 准备工作 用wget从https://sourceware.org/systemtap/ftp/releases/下载最新版的systemtap.tar.gz压缩包 ...

  5. 一起来玩echarts系列(一)------箱线图的分析与绘制

    一.箱线图 Box-plot 箱线图一般被用作显示数据分散情况.具体是计算一组数据的中位数.25%分位数.75%分位数.上边界.下边界,来将数据从大到小排列,直观展示数据整体的分布情况. 大部分正常数 ...

  6. 应用工具 .NET Portability Analyzer 分析迁移dotnet core

    大多数开发人员更喜欢一次性编写好业务逻辑代码,以后再重用这些代码.与构建不同的应用以面向多个平台相比,这种方法更加容易.如果您创建与 .NET Core 兼容的.NET 标准库,那么现在比以往任何时候 ...

  7. UWP中新加的数据绑定方式x:Bind分析总结

    UWP中新加的数据绑定方式x:Bind分析总结 0x00 UWP中的x:Bind 由之前有过WPF开发经验,所以在学习UWP的时候直接省略了XAML.数据绑定等几个看着十分眼熟的主题.学习过程中倒是也 ...

  8. 查看w3wp进程占用的内存及.NET内存泄露,死锁分析

    一 基础知识 在分析之前,先上一张图: 从上面可以看到,这个w3wp进程占用了376M内存,启动了54个线程. 在使用windbg查看之前,看到的进程含有 *32 字样,意思是在64位机器上已32位方 ...

  9. ZIP压缩算法详细分析及解压实例解释

    最近自己实现了一个ZIP压缩数据的解压程序,觉得有必要把ZIP压缩格式进行一下详细总结,数据压缩是一门通信原理和计算机科学都会涉及到的学科,在通信原理中,一般称为信源编码,在计算机科学里,一般称为数据 ...

  10. ABP源码分析一:整体项目结构及目录

    ABP是一套非常优秀的web应用程序架构,适合用来搭建集中式架构的web应用程序. 整个Abp的Infrastructure是以Abp这个package为核心模块(core)+15个模块(module ...

随机推荐

  1. RHCA cl210 016 流表 overlay

    Overlay网络是建立在Underlay网络上的逻辑网络 underlay br-int 之间建立隧道 数据流量还是从eth1出去 只有vlan20 是geneve隧道.只有租户网络有子网,子网需要 ...

  2. 基于禅道数据库对bug进行不同维度统计

    工作中经常需要在周报.月报.年报对禅道bug数据进行不同维度统计导出,以下是我常用的统计sql 1.统计2022年每个月bug数(deleted='0'是查询未删除的bug) select DATE_ ...

  3. Windows安装虚拟机软件-VirtualBox

    1.VirtualBox简介 VirtualBox号称是最强的开源免费虚拟机软件,它不仅具有丰富的特色,而且性能也很优异. 它简单易用,可虚拟的系统包括Windows.Mac OS X.Linux.O ...

  4. 【TypeScript】01 基础入门

    前提:使用TypeScript你需要安装NodeJS支持 然后安装TypeScript: npm intsall -g typescript 安装完成后查看版本号: tsc -v 新建一个TypeSc ...

  5. 【转载】手动DIY制作机械臂

    相关链接: https://news.cnblogs.com/n/703664/ https://www.bilibili.com/video/BV12341117rG https://www.cnb ...

  6. 结合实例看 maven 传递依赖与优先级,难顶也得上丫

    开心一刻 想买摩托车了,但是钱不够,想找老爸借点 我:老爸,我想买一辆摩托车,上下班也方便 老爸:你表哥上个月骑摩托车摔走了,你不知道?还要买摩托车? 我:对不起,我不买了 老板:就是啊,骑你表哥那辆 ...

  7. Apache DolphinScheduler数仓任务管理规范

    前言: 大数据领域对多种任务都有调度需求,以离线数仓的任务应用最多,许多团队在调研开源产品后,选择Apache DolphinScheduler(以下简称DS)作为调度场景的技术选型.得益于DS优秀的 ...

  8. 为什么大部分的 PHP 程序员转不了 Go 语言?

    大家好,我是码农先森. 树挪死,人挪活,这个需求我做不了,换个人吧.大家都有过这种经历吧,放在编程语言身上就是 PHP 不行了,赶紧转 Go 语言吧.那转 Go 语言就真的行了?那可不见得,我个人认为 ...

  9. Synology NAS GitLab 配置

    安装 安装的时候会提示服务器名.root用户名等,这步服务器名千万不要写错,不然会登不上去,提示 502. root 密码 网上有很多说 root 密码怎么获取的,但是都不适用. 实际上是第一个访问 ...

  10. SMCA:港中文提出注意力图校准的DETR加速方案 | ICCV 2021

    为了加速DETR收敛,论文提出了简单而有效的Spatially Modulated Co-Attention(SMCA)机制,通过在初始边界框位置给予较高的协同注意力响应值的约束来构建DETR的回归感 ...