在page cache中的页，如果当时没有进程read或者write，引用计数到底该为多少

在一次偶然的机会，在研究如何降低pagecache占用的过程中，走查了 invalidate_mapping_pages的代码：

通过调用 __pagevec_lookup 在radix树中收集一部分page，然后尝试调用 invalidate_inode_page 来释放这个page。

我主要看__pagevec_lookup 对引用计数的修改：

__pagevec_lookup -- >__find_get_pages -->page_cache_get_speculative

static inline int page_cache_get_speculative(struct page *page)

{/*³¢ÊÔÔö¼ÓÒýÓÃ¼ÆÊý£¬Èç¹ûpage²»ÎªfreeµÄ»°*/

    VM_BUG_ON(in_interrupt());

#ifdef CONFIG_TINY_RCU

# ifdef CONFIG_PREEMPT_COUNT

    VM_BUG_ON(!in_atomic());

# endif

    /*

     * Preempt must be disabled here - we rely on rcu_read_lock doing

     * this for us.

     *

     * Pagecache won't be truncated from interrupt context, so if we have

     * found a page in the radix tree here, we have pinned its refcount by

     * disabling preempt, and hence no need for the "speculative get" that

     * SMP requires.

     */

    VM_BUG_ON_PAGE(page_count(page) == , page);

    page_ref_inc(page);

#else

    if (unlikely(!get_page_unless_zero(page))) {-------------走这个分支

        /*

         * Either the page has been freed, or will be freed.

         * In either case, retry here and the caller should

         * do the right thing (see comments above).

         */

        return ;

    }

#endif

    VM_BUG_ON_PAGE(PageTail(page), page);

    return ;

}

static inline int get_page_unless_zero(struct page *page)
{
return page_ref_add_unless(page, 1, 0);
}

static inline int page_ref_add_unless(struct page *page, int nr, int u)
{
return atomic_add_unless(&page->_count, nr, u);
}

/**

 * atomic_add_unless - add unless the number is already a given value

 * @v: pointer of type atomic_t

 * @a: the amount to add to v...

 * @u: ...unless v is equal to u.

 *

 * Atomically adds @a to @v, so long as @v was not already @u.

 * Returns non-zero if @v was not @u, and zero otherwise.

 */

static inline int atomic_add_unless(atomic_t *v, int a, int u)

{

    return __atomic_add_unless(v, a, u) != u;

}

最后一个函数注释很明白，除非 page->_count 的计数为1，否则不增加引用计数，也就是说，当原值 page->_count 为1的时候，增加到2，然后返回。

然后看释放过程：

            if (!trylock_page(page))

                continue;

            WARN_ON(page->index != index);

            ret = invalidate_inode_page(page);

            unlock_page(page);

主要的调用链为:

invalidate_inode_page--> invalidate_complete_page --> remove_mapping --> __remove_mapping

/*

 * Same as remove_mapping, but if the page is removed from the mapping, it

 * gets returned with a refcount of 0.

 */

static int __remove_mapping(struct address_space *mapping, struct page *page,

                bool reclaimed)

{

    BUG_ON(!PageLocked(page));

    BUG_ON(mapping != page_mapping(page));

    spin_lock_irq(&mapping->tree_lock);

    /*

     * The non racy check for a busy page.

     *

     * Must be careful with the order of the tests. When someone has

     * a ref to the page, it may be possible that they dirty it then

     * drop the reference. So if PageDirty is tested before page_count

     * here, then the following race may occur:

     *

     * get_user_pages(&page);

     * [user mapping goes away]

     * write_to(page);

     *                !PageDirty(page)    [good]

     * SetPageDirty(page);

     * put_page(page);

     *                !page_count(page)   [good, discard it]

     *

     * [oops, our write_to data is lost]

     *

     * Reversing the order of the tests ensures such a situation cannot

     * escape unnoticed. The smp_rmb is needed to ensure the page->flags

     * load is not satisfied before that of page->_count.

     *

     * Note that if SetPageDirty is always performed via set_page_dirty,

     * and thus under tree_lock, then this ordering is not required.

     */

    if (!page_ref_freeze(page, ))

        goto cannot_free;

    /* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */

    if (unlikely(PageDirty(page))) {

        page_ref_unfreeze(page, );

        goto cannot_free;

    }

    if (PageSwapCache(page)) {

        swp_entry_t swap = { .val = page_private(page) };

        __delete_from_swap_cache(page);

        spin_unlock_irq(&mapping->tree_lock);

        swapcache_free(swap, page);

    } else {

        void (*freepage)(struct page *);

        void *shadow = NULL;

        freepage = mapping->a_ops->freepage;

        /*

         * Remember a shadow entry for reclaimed file cache in

         * order to detect refaults, thus thrashing, later on.

         *

         * But don't store shadows in an address space that is

         * already exiting.  This is not just an optizimation,

         * inode reclaim needs to empty out the radix tree or

         * the nodes are lost.  Don't plant shadows behind its

         * back.

         *

         * We also don't store shadows for DAX mappings because the

         * only page cache pages found in these are zero pages

         * covering holes, and because we don't want to mix DAX

         * exceptional entries and shadow exceptional entries in the

         * same page_tree.

         */

        if (reclaimed && page_is_file_cache(page) &&

            !mapping_exiting(mapping) && !dax_mapping(mapping))

            shadow = workingset_eviction(mapping, page);

        __delete_from_page_cache(page, shadow);

        spin_unlock_irq(&mapping->tree_lock);

        mem_cgroup_uncharge_cache_page(page);

        if (freepage != NULL)

            freepage(page);

    }

    return ;

cannot_free:

    spin_unlock_irq(&mapping->tree_lock);

    return ;

}

看第一行的注释，我们可以知道，remove_mapping 是__remove_mapping的包裹函数， __remove_mapping函数如果将页面从page cache中移除成功，则会将page的引用计数返回0.

我们关注最关键的一段：

   if (!page_ref_freeze(page, ))

        goto cannot_free;

static inline int page_ref_freeze(struct page *page, int count)

{

    return likely(atomic_cmpxchg(&page->_count, count, ) == count);

}

atomic_cmpxchg函数实现了一个比较+交换的原子操作(原子就是说cpu要不就不

做，要做就一定要做完，不会存在中间状态,对应这里就是比较和交换要一次过做完).

关于atomic 和比较交换的一些函数，网上资料较多，在此不赘述，总体的意思就是，当page的count为2，则交换为0，并且返回旧值2，

也就是 page->_count 为2的话，则可以释放，否则会走 cannot_free 的分支。因为之前 __pagevec_lookup 执行后，page->_count肯定为2，所以能最终free。

回到包裹函数 remove_mapping

/*

 * Attempt to detach a locked page from its ->mapping.  If it is dirty or if

 * someone else has a ref on the page, abort and return 0.  If it was

 * successfully detached, return 1.  Assumes the caller has a single ref on

 * this page.

 */

int remove_mapping(struct address_space *mapping, struct page *page)

{

    if (__remove_mapping(mapping, page, false)) {

        /*

         * Unfreezing the refcount with 1 rather than 2 effectively

         * drops the pagecache ref for us without requiring another

         * atomic operation.

         */

        page_ref_unfreeze(page, );

        return ;

    }

    return ;

}

因为在lookup阶段，将原有的因为计数page->_count为1的，增加到2，然后通过一个交换判断，如果是2的，则交换为0，然后最终调用 page_ref_unfreeze 将引用计数设置为1，

释放成功。有兴趣的同学可以继续一下，因为在pagecache中的页，默认是加入到lru的，为了防止频繁地加入到lru，又设计了一个pagevec数组，当数组满了或者主动调用drain函数来将

数组中缓存的page 刷入到lru链表中，为了区分这两种状态，加入到lru的pagevec数组的page，计数要加1，当真正加入到lru中的时候，计数又减1，恢复到之前的计数值。所以lru并不占用计数。

我们知道，页面从freelist中分配出来的时候，引用计数是需要加1的。

get_page_from_freelist->buffered_rmqueue->prep_new_page->set_page_refcounted，由此函数完成+1.

写本文的原因是，我原来以为的加入到lru，加入到radix树，都需要增加引用计数的，在加入radix树，确实是加1了，但一般出来之后就会-1，所以在radix树的时候，计数增加只是临时行为，lru也是如此，因为加入到lru，我只看到了加入到lru的pagevec数组，这个时候确实是+1了，但是

当真正加入到lru链表的时候，又减了1，也就是page真正加入到lru链表，会保持计数不变，当然PG_Lru肯定是要置位的。

同理，我在看函数 add_to_page_cache_lru 的时候，确实对加入的page 增加了引用计数，所以一直认为pagecache中的页的引用计数至少是2，调用 __pagevec_lookup 后应该为3，

当我看到如下代码之后，就死活不理解。

 if (!page_ref_freeze(page, 2))

        goto cannot_free;

然后回过头再去看单凡是调用add_to_page_cache_lru 之后，都会调用put_page，不管成功失败，如果成功，相当于加入pagecache成功，对应的put_page就是减去对加入的时候的page的引用计数，那么此时计数为1，如果加入失败，那么对应的put_page就是释放内存，因为此时计数为0.

回到文章的开头，如果一个页面，没人访问，在pagecache中，当然也在lru链表中的时候，引用计数为1，而这个1，还是从freelist中摘除的时候来增加的。也有可能为2，此时说明没有在lru中，只在pagevec中，

如果一个页面，没人访问，在pagecache中，但是处于lru的lruvec中，此时的引用计数应该为2，所以才会有在调用fadvise64_64 的时候，

case POSIX_FADV_DONTNEED:

        if (!bdi_write_congested(mapping->backing_dev_info))

            __filemap_fdatawrite_range(mapping, offset, endbyte,

                           WB_SYNC_NONE);

        /* First and last FULL page! */

        start_index = (offset+(PAGE_CACHE_SIZE-)) >> PAGE_CACHE_SHIFT;

        end_index = (endbyte >> PAGE_CACHE_SHIFT);

        if (end_index >= start_index) {

            unsigned long count = invalidate_mapping_pages(mapping,

                        start_index, end_index);

            /*

             * If fewer pages were invalidated than expected then

             * it is possible that some of the pages were on

             * a per-cpu pagevec for a remote CPU. Drain all

             * pagevecs and try again.

             */

            if (count < (end_index - start_index + )) {

                lru_add_drain_all();

                invalidate_mapping_pages(mapping, start_index,

                        end_index);

            }

        }

        break;

当发现释放的页面小于请求的页面数，会调用 lru_add_drain_all ，如果不调用这个，则有可能因为处于lru的pagevec的页无法释放，其实有大概率是能够释放的。

当时的stap记录如下：

调用 page_cache_get_speculative 之前的计数，为1，此时就是出于pagecache中的页的原本计数，（排除其他正在使用的页）

enter 1147=page=0xffffea000e955080,flags=0x6fffff00020068,mapcount=-1,_count=1===
0xffffffff81183381 : __find_get_pages+0x81/0x170 [kernel]
0xffffffff8118ff2e : __pagevec_lookup+0x1e/0x30 [kernel]
0xffffffff81191243 : invalidate_mapping_pages+0x93/0x1f0 [kernel]
0xffffffff811850b4 : SyS_fadvise64_64+0x1a4/0x290 [kernel]
0xffffffff811851ae : SyS_fadvise64+0xe/0x10 [kernel]
0xffffffff81698b09 : system_call_fastpath+0x16/0x1b [kernel]

调用 page_cache_get_speculative之后的计数，为2，
enter 1163=page=0xffffea000e955080,flags=0x6fffff00020068,mapcount=-1,_count=2===
0xffffffff811833b5 : __find_get_pages+0xb5/0x170 [kernel]
0xffffffff8118ff2e : __pagevec_lookup+0x1e/0x30 [kernel]
0xffffffff81191243 : invalidate_mapping_pages+0x93/0x1f0 [kernel]
0xffffffff811850b4 : SyS_fadvise64_64+0x1a4/0x290 [kernel]
0xffffffff811851ae : SyS_fadvise64+0xe/0x10 [kernel]
0xffffffff81698b09 : system_call_fastpath+0x16/0x1b [kernel]

有一个同事问到，在__generic_file_splice_read 函数中，有一个while循环

	while (spd.nr_pages < nr_pages) {

		/*

		 * Page could be there, find_get_pages_contig() breaks on

		 * the first hole.

		 */

		page = find_get_page(mapping, index);//找具体的page，之前连续的时候没找到的

		if (!page) {//经过预读仍然没找到

			/*

			 * page didn't exist, allocate one.

			 */

			page = page_cache_alloc_cold(mapping);//分配页面

			if (!page)

				break;

//加入到radix树，主要有修改page的mapping等

			error = add_to_page_cache_lru(page, mapping, index,

						GFP_KERNEL);

			if (unlikely(error)) {

				page_cache_release(page);

				if (error == -EEXIST)

					continue;

				break;

			}

			/*

			 * add_to_page_cache() locks the page, unlock it

			 * to avoid convoluting the logic below even more.

			 */

			unlock_page(page);

		}

		spd.pages[spd.nr_pages++] = page;//将找到的或者分配的页面加入到spd

		index++;

	}

　　经过page_cache_alloc_cold 再加入到 add_to_page_cache_lru 的页面，并没有-1啊，岂不是跟之前的描述矛盾，这个地方没有减1，其实是因为这个计数本就应该为2，因为这个page加入到了spd中，计数必须增加，既然-1又需要加1，干脆就不动，前面通过 find_get_pages_contig 加入到spd中的页，此时计数应该也是2,（排查并发操作的情况，否则就是>2）.

经过 find_get_pages_contig 加入到spd，或者通过 spd.pages[spd.nr_pages++] = page;//将找到的或者分配的页面加入到spd

保证了此时在spd中的page的计数都至少为2.这个在spd进行release的时候，统一进行-1，计数又恢复了。

所以说，pagecache中的且位于lru链表的page，在没有读写，也没有kswap正在对该page进行老化的情况下，引用计数就是1。

在page cache中的页，如果当时没有进程read或者write，引用计数到底该为多少的更多相关文章

page cache 与free
我们经常用free查看服务器的内存使用情况,而free中的输出却有些让人困惑,如下: 先看看各个数字的意义以及如何计算得到: free命令输出的第二行(Mem):这行分别显示了物理内存的总量(tota ...
从free到page cache
Free 我们经常用free查看服务器的内存使用情况,而free中的输出却有些让人困惑,如下: 图1-1 先看看各个数字的意义以及如何计算得到: free命令输出的第二行(Mem):这行分别显示了 ...
Linux系统中的Page cache和Buffer cache
Linux系统中的Page cache和Buffer cache Linux中有两个很容易混淆的概念,pagecache和buffercache,首先简单将一些Linux系统下内存的分布,使用free ...
Page Cache(页缓存)
Page Cache 由内存中的物理page组成,其内容对应磁盘上的block. page cache的大小是动态变化的. backing store: cache缓存的存储设备一个page通常包含 ...
Page cache和Buffer cache[转1]
http://www.cnblogs.com/mydomain/archive/2013/02/24/2924707.html Page cache实际上是针对文件系统的,是文件的缓存,在文件层面上的 ...
page cache 与 page buffer 转
page cache 与 page buffer 标签: cachebuffer磁盘treelinux脚本 2012-05-07 20:47 2905人阅读评论(0) 收藏举报分类: 内核编程 ...
linux Page cache和buffer cache正解
Page cache和buffer cache一直以来是两个比较容易混淆的概念,在网上也有很多人在争辩和猜想这两个cache到底有什么区别,讨论到最后也一直没有一个统一和正确的结论,在我工作的这一段时 ...
Page Cache的落地问题
除非特别说明,否则本文提到的写操作都是 buffer write/write back. 起因前几天讨论到一个问题:Linux 下文件 close成功,会不会触发 “刷盘”? 其实这个问题根本不用讨 ...
linux 中的页缓存和文件 IO
本文所述是针对 linux 引入了虚拟内存管理机制以后所涉及的知识点.linux 中页缓存的本质就是对于磁盘中的部分数据在内存中保留一定的副本,使得应用程序能够快速的读取到磁盘中相应的数据,并实现不同 ...

随机推荐

[UE4]Delay的使用技巧：改变引擎执行顺序
如果要游戏一开始就让机器人开火,但这是引擎还没有执行到武器的创建步骤,就可以使用“Delay”并设置函数的等待时间,让引擎先执行创建枪的步骤,然后机器人开火就没问题了.
JIT和AOT编译详解
JIT和AOT编译介绍 JIT - Just-In-Time 实时编译,即时编译通常所说的JIT的优势是Profile-Based Optimization,也就是边跑边优化 ...
Kaptcha
Kaptcha:google自动生成验证码组件 kaptcha的使用比较方便,只需添加jar包依赖之后简单地配置就可以使用了 kaptcha所有配置都可以通过web.xml来完成,如果你的项目中使用了 ...
第31课老生常谈的两个宏(linux)
1. Linux内核中常用的两个宏定义 (1)offsetof宏:用于计算TYPE结构体中MEMBER成员的偏移位置 #ifndef offsetof #define offsetof(TYPE, M ...
CSS便捷开发小工具汇总
1.Prefix free 可以帮助开发者省去编写各种CSS3属性前缀的工作,只需要在页面中引入prefixfree.js即可. 2. Normalize 是一个CSS Reset工具, 相比传统的R ...
一篇文章，教你学会Git
在日常工作中,经常会用到Git操作.但是对于新人来讲,刚上来对Git很陌生,操作起来也很懵逼.本篇文章主要针对刚开始接触Git的新人,理解Git的基本原理,掌握常用的一些命令. 一.Git工作流程以 ...
Spring MVC开启注解
1.在spring的配置文件中配置:<context:annotation-config />该项配置只能应用于已经注册的bean,比较难用,不深究. 2.在spring的配置文件中使用c ...
05-spark streaming & kafka
1.如何消费已经被消费过的数据? 答:采用不同的group 2.如何自定义去消费已经消费过的数据? Conosumer.properties配置文件中有两个重要参数 auto.commit.enabl ...
安全测试6_Web安全工具第一节（浏览器入门及扩展）
今天来学习下浏览器的功能,浏览器是我们经常用到但是功能却很强大的一个东东,我们经常用到的无非是三种(谷歌.火狐.IE) 1.浏览器功能介绍: 下面以谷歌浏览器(Chrome版本为56)为例,介绍下,懂 ...
Java实现图像对比类
package com.function; import java.awt.image.BufferedImage; import java.io.BufferedWriter; import jav ...

在page cache中的页，如果当时没有进程read或者write，引用计数到底该为多少

在page cache中的页，如果当时没有进程read或者write，引用计数到底该为多少的更多相关文章

随机推荐

热门专题