balance_dirty_pages_ratelimited分析

nr_dirtied_pause：当前task的脏页门限；
dirty_exceeded：全局的脏页数超过门限或者该bdi的脏页数超过门限；（dirty_exceeded = (bdi_dirty > bdi_thresh) &&((nr_dirty > dirty_thresh) || strictlimit); ）
bdp_ratelimits：percpu变量，当前CPU的脏页数
ratelimit_pages：CPU的脏页门限

调用balance_dirty_pages的条件有：

1：当前task的脏页数量大于ratelimit ，（如果dirty_exceeded为0，则为current->nr_dirtied_pause；如果dirty_exceeded为1，则最大为32KB）

2：当前CPU的脏页数超过了门限值ratelimit_pages；

3：当前脏页数+退出线程遗留的脏页超过了门限；

void balance_dirty_pages_ratelimited(struct address_space *mapping)

{

	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);

	int ratelimit;

	int *p;

	if (!bdi_cap_account_dirty(bdi))

		return;

	ratelimit = current->nr_dirtied_pause;  /* 门限：初始值为32表示128KB */

	if (bdi->dirty_exceeded)                /* 如果该值设置了，则需要通过降低平衡触发的门限来加速脏页回收 */

		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));  /* 重新修改门限，最大为32KB，初始值128KB，加快回收 */

	preempt_disable();

	/*

	 * This prevents one CPU to accumulate too many dirtied pages without

	 * calling into balance_dirty_pages(), which can happen when there are

	 * 1000+ tasks, all of them start dirtying pages at exactly the same

	 * time, hence all honoured too large initial task->nr_dirtied_pause.

	 */

	/* 即保证当前线程脏页数超过门限，或者当前CPU超过门限，都要回收 */

	p =  this_cpu_ptr(&bdp_ratelimits);  /* 当前CPU的脏页计数 */

	if (unlikely(current->nr_dirtied >= ratelimit))  /* 如果当前线程脏页数超过门限值，则肯定会触发下面的回收流程。同时重新计算当前CPU的脏页数 */

		*p = 0;

	else if (unlikely(*p >= ratelimit_pages)) {     /* 默认值为32页 */ /* 当前线程的脏页数未超过门限值，但是当前CPU的脏页数超过CPU脏页门限值，则设置门限为0，肯定会触发回收。同时重新计算当前CPU的脏页数 */

		*p = 0;

		ratelimit = 0;

	}

	/*

	 * Pick up the dirtied pages by the exited tasks. This avoids lots of

	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping

	 * the dirty throttling and livelock other long-run dirtiers.

	 */

	p = this_cpu_ptr(&dirty_throttle_leaks);   /* 退出的线程，也放在这里处理 */

	if (*p > 0 && current->nr_dirtied < ratelimit) {

		unsigned long nr_pages_dirtied;

		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);

		*p -= nr_pages_dirtied;

		current->nr_dirtied += nr_pages_dirtied;

	}

	preempt_enable();

	if (unlikely(current->nr_dirtied >= ratelimit))    /* 当前线程脏页超过门限值 */

		balance_dirty_pages(mapping, current->nr_dirtied);

}

EXPORT_SYMBOL(balance_dirty_pages_ratelimited);

正常情况下应该是周期回收和背景回收，不会占用当前task的时间。但是当dirty > dirty_freerun_ceiling(thresh, bg_thresh) 即脏页数大于直接回收门限和背景回收门限的1/2时，需要将当前CPU休眠一会，让回收线程工作。

但是dirty <= dirty_freerun_ceiling(thresh, bg_thresh)，也会动态的调整nr_dirtied_pause ，号让其更好的回收，调整的策略为：

static unsigned long dirty_poll_interval(unsigned long dirty,

					 unsigned long thresh)

{

	/*  */

	if (thresh > dirty)  /*  */

		return 1UL << (ilog2(thresh - dirty) >> 1);

	return 1;  /* 脏页数超过门限值，则返回1页就需要回收 */

}

至于为什么这么做，可以参考如下解析：

/*

Ideally if we know there are N dirtiers, it’s safe to let each task

poll at (thresh-dirty)/N without exceeding the dirty limit.

However we neither know the current N, nor is sure whether it will

rush high at next second. So sqrt is used to tolerate larger N on

increased (thresh-dirty) gap:

irb> 0.upto(10) { |i| mb=2**i; pages=mb<<(20-12); printf “%4d\t%4d\n”, mb, Math.sqrt(pages)}

1 16

2 22

4 32

8 45

16 64

32 90

64 128

128 181

256 256

512 362

1024 512

The above table means, given 1MB (or 1GB) gap and the dd tasks polling

balance_dirty_pages() on every 16 (or 512) pages, the dirty limit

won’t be exceeded as long as there are less than 16 (or 512) concurrent

dd’s.

Note that dirty_poll_interval() will mainly be used when (dirty < freerun).

When the dirty pages are floating in range [freerun, limit],

“[PATCH 14/18] writeback: control dirty pause time” will independently

adjust tsk->nr_dirtied_pause to get suitable pause time.

So the sqrt naturally leads to less overheads and more N tolerance for

large memory servers, which have large (thresh-freerun) gaps.

void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)

{

	/* 可用内存并不是系统所有内存，而是free pages + reclaimable pages(文件页) */

	const unsigned long available_memory = global_dirtyable_memory();

	unsigned long background;

	unsigned long dirty;

	struct task_struct *tsk;

	if (vm_dirty_bytes)

		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);

	else

		dirty = (vm_dirty_ratio * available_memory) / 100;

	if (dirty_background_bytes)

		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);

	else

		background = (dirty_background_ratio * available_memory) / 100;

	if (background >= dirty)

		background = dirty / 2;

	tsk = current;

	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {   /* 如果设置了该属性PF_LESS_THROTTLE或者是实时线程，门限稍微提高1/4 */

		background += background / 4;

		dirty += dirty / 4;

	}

	*pbackground = background;

	*pdirty = dirty;

	trace_global_dirty_state(background, dirty);

}

static unsigned long global_dirtyable_memory(void)

{

	unsigned long x;

	/* 可用内存并不是系统所有内存，而是free pages + file pages(文件页) */

	x = global_page_state(NR_FREE_PAGES);

	x -= min(x, dirty_balance_reserve);

	x += global_page_state(NR_INACTIVE_FILE);

	x += global_page_state(NR_ACTIVE_FILE);

	if (!vm_highmem_is_dirtyable)

		x -= highmem_dirtyable_memory(x);

	return x + 1;	/* Ensure that we never return 0 */

}

1：如果可回收+正在回写脏页数量 < background和显式回写阈值的均值此次先不启动回写，否则启动background回写

2：如果可回收的脏页数大于背景回收门限值，则触发背景回收执行；

static void balance_dirty_pages(struct address_space *mapping,

				unsigned long pages_dirtied)

{

	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */

	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */

	unsigned long background_thresh;

	unsigned long dirty_thresh;

	long period;

	long pause;

	long max_pause;

	long min_pause;

	int nr_dirtied_pause;

	bool dirty_exceeded = false;

	unsigned long task_ratelimit;

	unsigned long dirty_ratelimit;

	unsigned long pos_ratio;

	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);

	bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; //单独门限值回收

	unsigned long start_time = jiffies;

	for (;;) {

		unsigned long now = jiffies;

		unsigned long uninitialized_var(bdi_thresh);

		unsigned long thresh;

		unsigned long uninitialized_var(bdi_dirty);

		unsigned long dirty;

		unsigned long bg_thresh;

		/*

		 * Unstable writes are a feature of certain networked

		 * filesystems (i.e. NFS) in which data may have been

		 * written to the server's write cache, but has not yet

		 * been flushed to permanent storage.

		 */

		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +

					global_page_state(NR_UNSTABLE_NFS);  /* 全局 文件脏页  + 网络文件系统 */  /* = file_dirty + unstable_nfs */

		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); /*全局 文件总的脏页+包括正在回写 */  /* = file_dirty + writeback + unstable_nfs */

		global_dirty_limits(&background_thresh, &dirty_thresh);//获取两个门限值

		if (unlikely(strictlimit)) {  /* 单独bdi回收 */

			bdi_dirty_limits(bdi, dirty_thresh, background_thresh,

					 &bdi_dirty, &bdi_thresh, &bg_thresh);

			dirty = bdi_dirty;

			thresh = bdi_thresh;

		} else {                       /* 全局回收 */

			dirty = nr_dirty;          /* 全局 文件总的脏页+包括正在回写 */

			thresh = dirty_thresh;

			bg_thresh = background_thresh;

		}

		/*

		 * Throttle it only when the background writeback cannot

		 * catch-up. This avoids (excessively) small writeouts

		 * when the bdi limits are ramping up in case of !strictlimit.

		 *

		 * In strictlimit case make decision based on the bdi counters

		 * and limits. Small writeouts when the bdi limits are ramping

		 * up are the price we consciously pay for strictlimit-ing.

		 */

		/* 小于直接回收文件和背景回收的/2, 不占用本线程时间；否则说明背景回收没有运行，需要占用本线程时间,  */

		if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) {  //(thresh + bg_thresh) / 2; 不回收

			current->dirty_paused_when = now;

			current->nr_dirtied = 0;                 /* 脏页数量重新置0 */

			current->nr_dirtied_pause =

				dirty_poll_interval(dirty, thresh);   /* 重新设置线程脏页门限 */

			break;

		}

		if (unlikely(!writeback_in_progress(bdi)))  /* 唤醒真正的回写线程 */

			bdi_start_background_writeback(bdi);

		if (!strictlimit)

			bdi_dirty_limits(bdi, dirty_thresh, background_thresh,

					 &bdi_dirty, &bdi_thresh, NULL);

		//nr_dirty > dirty_thresh

		/*

		 * 如果是单个bdi独自回收，当前bdi的 脏页超过门限即回收；

		 * 如果是整个系统回收，当前bdi超过门限且系统的脏页也要超超过门限;

		 */

		dirty_exceeded = (bdi_dirty > bdi_thresh) &&

				 ((nr_dirty > dirty_thresh) || strictlimit); //超过门限

		if (dirty_exceeded && !bdi->dirty_exceeded)

			bdi->dirty_exceeded = 1;                        //超过门限，后面需要加速回收

		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,

				     nr_dirty, bdi_thresh, bdi_dirty,

				     start_time);

		dirty_ratelimit = bdi->dirty_ratelimit;

		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,

					       background_thresh, nr_dirty,

					       bdi_thresh, bdi_dirty);

		task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>

							RATELIMIT_CALC_SHIFT;

		max_pause = bdi_max_pause(bdi, bdi_dirty);

		min_pause = bdi_min_pause(bdi, max_pause,

					  task_ratelimit, dirty_ratelimit,

					  &nr_dirtied_pause);

		if (unlikely(task_ratelimit == 0)) {

			period = max_pause;

			pause = max_pause;

			goto pause;

		}

		period = HZ * pages_dirtied / task_ratelimit;

		pause = period;

		if (current->dirty_paused_when)

			pause -= now - current->dirty_paused_when;

		/*

		 * For less than 1s think time (ext3/4 may block the dirtier

		 * for up to 800ms from time to time on 1-HDD; so does xfs,

		 * however at much less frequency), try to compensate it in

		 * future periods by updating the virtual time; otherwise just

		 * do a reset, as it may be a light dirtier.

		 */

		if (pause < min_pause) {

			trace_balance_dirty_pages(bdi,

						  dirty_thresh,

						  background_thresh,

						  nr_dirty,

						  bdi_thresh,

						  bdi_dirty,

						  dirty_ratelimit,

						  task_ratelimit,

						  pages_dirtied,

						  period,

						  min(pause, 0L),

						  start_time);

			if (pause < -HZ) {

				current->dirty_paused_when = now;

				current->nr_dirtied = 0;

			} else if (period) {

				current->dirty_paused_when += period;

				current->nr_dirtied = 0;

			} else if (current->nr_dirtied_pause <= pages_dirtied)

				current->nr_dirtied_pause += pages_dirtied;

			break;

		}

		if (unlikely(pause > max_pause)) {

			/* for occasional dropped task_ratelimit */

			now += min(pause - max_pause, max_pause);

			pause = max_pause;

		}

pause:

		trace_balance_dirty_pages(bdi,

					  dirty_thresh,

					  background_thresh,

					  nr_dirty,

					  bdi_thresh,

					  bdi_dirty,

					  dirty_ratelimit,

					  task_ratelimit,

					  pages_dirtied,

					  period,

					  pause,

					  start_time);

		__set_current_state(TASK_KILLABLE);

		io_schedule_timeout(pause);//有可能会切出去，但最大超过200ms

		current->dirty_paused_when = now + pause;

		current->nr_dirtied = 0;

		current->nr_dirtied_pause = nr_dirtied_pause;

		/*

		 * This is typically equal to (nr_dirty < dirty_thresh) and can

		 * also keep "1000+ dd on a slow USB stick" under control.

		 */

		if (task_ratelimit)

			break;

		/*

		 * In the case of an unresponding NFS server and the NFS dirty

		 * pages exceeds dirty_thresh, give the other good bdi's a pipe

		 * to go through, so that tasks on them still remain responsive.

		 *

		 * In theory 1 page is enough to keep the comsumer-producer

		 * pipe going: the flusher cleans 1 page => the task dirties 1

		 * more page. However bdi_dirty has accounting errors.  So use

		 * the larger and more IO friendly bdi_stat_error.

		 */

		if (bdi_dirty <= bdi_stat_error(bdi))

			break;

		if (fatal_signal_pending(current))

			break;

	}

	if (!dirty_exceeded && bdi->dirty_exceeded)  //如果不超过门限，则置0

		bdi->dirty_exceeded = 0;

	if (writeback_in_progress(bdi))  //正在回收，则退出

		return;

	/*

	 * In laptop mode, we wait until hitting the higher threshold before

	 * starting background writeout, and then write out all the way down

	 * to the lower threshold.  So slow writers cause minimal disk activity.

	 *

	 * In normal mode, we start background writeout at the lower

	 * background_thresh, to keep the amount of dirty memory low.

	 */

	/*

	* 节能模式，起到什么作用呢？？

	*/

	if (laptop_mode)

		return;

	if (nr_reclaimable > background_thresh) //可回收的页面大于background_thresh，则触发线程异步回收

		bdi_start_background_writeback(bdi);

}

balance_dirty_pages_ratelimited分析的更多相关文章

Linux Kernel文件系统写I/O流程代码分析（一）
Linux Kernel文件系统写I/O流程代码分析(一) 在Linux VFS机制简析(二)这篇博客上介绍了struct address_space_operations里底层文件系统需要实现的操作 ...
用户空间缺页异常pte_handle_fault()分析--(上)【转】
转自:http://blog.csdn.net/vanbreaker/article/details/7881206 版权声明:本文为博主原创文章,未经博主允许不得转载. 前面简单的分析了内核处理用户 ...
alias导致virtualenv异常的分析和解法
title: alias导致virtualenv异常的分析和解法 toc: true comments: true date: 2016-06-27 23:40:56 tags: [OS X, ZSH ...
火焰图分析openresty性能瓶颈
注:本文操作基于CentOS 系统准备工作用wget从https://sourceware.org/systemtap/ftp/releases/下载最新版的systemtap.tar.gz压缩包 ...
一起来玩echarts系列（一）------箱线图的分析与绘制
一.箱线图 Box-plot 箱线图一般被用作显示数据分散情况.具体是计算一组数据的中位数.25%分位数.75%分位数.上边界.下边界,来将数据从大到小排列,直观展示数据整体的分布情况. 大部分正常数 ...
应用工具 .NET Portability Analyzer 分析迁移dotnet core
大多数开发人员更喜欢一次性编写好业务逻辑代码,以后再重用这些代码.与构建不同的应用以面向多个平台相比,这种方法更加容易.如果您创建与 .NET Core 兼容的.NET 标准库,那么现在比以往任何时候 ...
UWP中新加的数据绑定方式x:Bind分析总结
UWP中新加的数据绑定方式x:Bind分析总结 0x00 UWP中的x:Bind 由之前有过WPF开发经验,所以在学习UWP的时候直接省略了XAML.数据绑定等几个看着十分眼熟的主题.学习过程中倒是也 ...
查看w3wp进程占用的内存及.NET内存泄露,死锁分析
一基础知识在分析之前,先上一张图: 从上面可以看到,这个w3wp进程占用了376M内存,启动了54个线程. 在使用windbg查看之前,看到的进程含有 *32 字样,意思是在64位机器上已32位方 ...
ZIP压缩算法详细分析及解压实例解释
最近自己实现了一个ZIP压缩数据的解压程序,觉得有必要把ZIP压缩格式进行一下详细总结,数据压缩是一门通信原理和计算机科学都会涉及到的学科,在通信原理中,一般称为信源编码,在计算机科学里,一般称为数据 ...
ABP源码分析一：整体项目结构及目录
ABP是一套非常优秀的web应用程序架构,适合用来搭建集中式架构的web应用程序. 整个Abp的Infrastructure是以Abp这个package为核心模块(core)+15个模块(module ...

随机推荐

web3 产品介绍： walletconnect 连接Web3 DApps与用户的移动加密钱包
WalletConnect是一种去中心化的开源协议,旨在连接Web3 DApps与用户的移动加密钱包,提供更安全.更便捷的加密货币交易体验.在本文中,我们将介绍WalletConnect的主要特点.工 ...
腾讯云免费申请SSL证书配置https
证书申请 1.进入腾讯云官网,在上方直接搜索SSL,搜索到后点击立即选购: 2.点击进去后选择自定义配置,加密标准选择默认的国际标准,证书种类选择域名免费版(DV),勾选同意服务条款后选择免费快速申请 ...
【Java】部门集合树状顺序展示
一.需求效果: 表单的部门下拉选择时,可以展示部门的层级: 按照这个效果展示,但是不是树,还是原来的集合二.实现方案: 用Java代码实现两个部分 1.展示Label效果处理 2.处理集合的树状排序 ...
【Java】 WebService 校验机制
测试环境域名不可见正式环境域名不可见 1.2.安全校验凭证 accessId(授权ID) 测试/正式待定 securityKey(加密密钥) 测试/正式待定 1.3.安全校验机制 1.3.1.在 ...
【Tutorial C】04 基本输入输出
输出单个字符 putchar('a'); // 字符输出函数,其功能是在终端(显示器)输出单个字符. putchar('\n'); // 支持转义换行 putchar(77); // 可以直接注入AS ...
P6764 [APIO2020] 粉刷墙壁
思路: 本质上能进行的操作就是我们算出从第 \(i\) 块砖开始,连续刷 \(M\) 块砖,是否有承包商可以刷出期望颜色. 那么设 \(f_i\) 表示 \([i,i+m-1]\) 是否合法,那么就变 ...
（续）signal-slot：python版本的多进程通信的信号与槽机制（编程模式）的库（library） —— 强化学习ppo算法库sample-factory的多进程包装器，实现类似Qt的多进程编程模式（信号与槽机制） —— python3.12版本下成功通过测试
前文: signal-slot:python版本的多进程通信的信号与槽机制(编程模式)的库(library) -- 强化学习ppo算法库sample-factory的多进程包装器,实现类似Qt的多进程 ...
国产深度学习框架 OneFlow 是否靠谱？
OneFlow框架的设计目标是实现:一个使用多机多卡就像使用单机单卡一样容易的深度学习框架. 可以说,这是国内最早的深度学习框架之一,也是至今还活着的公司中开发支持力度最低的,也是最缺少技术支持.用户 ...
在Vue3中如何为路由Query参数标注类型
前言最近发布了一款支持IOC容器的Vue3框架:Zova.与以往的OOP或者Class方案不同,Zova在界面交互层面仍然采用Setup语法,仅仅在业务层面引入IOC容器.IOC容器犹如一把钥匙,为 ...
使用Jackson读取xml
找了不少,什么峰的,什么dn的参差不齐的资料,废话不少,问题是导入的包也没有.不多废话,看下面代码直接复用. package bean;import com.fasterxml.jackson.dat ...

balance_dirty_pages_ratelimited分析

balance_dirty_pages_ratelimited分析的更多相关文章

随机推荐

热门专题