遍历数据库中的所有记录时,我们首先想到的是Model.all.each。但是,当数据量很大的时候(数万?),这就不怎么合适了,因为Model.all.each会一次性加载所有记录,并将其实例化成 Model 对象,这显然会增加内存负担,甚至耗尽内存。

对于ActiveRecord 而言,有个find_each专门解决此类问题。find_each底层依赖于find_in_batches,会分批加载记录,默认每批为1000。

Mongoid而言,话说可以直接用Person.all.each,它会自动利用游标(cursor)帮你分批加载的。不过有个问题得留意一下:cursor有个10分钟超时限制。这就意味着遍历时长超过10分钟就危险了,很可能在中途遭遇no cursor错误。

# gems/mongo-2.2.4/lib/mongo/collection.rb:218
# @option options [ true, false ] :no_cursor_timeout The server normally times out idle cursors
# after an inactivity period (10 minutes) to prevent excess memory use. Set this option to prevent that.

虽然可以这样来绕过它 Model.all.no_timeout.each,不过不建议这样做。另外默认的batch_size不一定合适你,可以这样指定Model.all.no_timeout.batch_size(500).each。不过 mongodb 的默认batch_size看起来比较复杂,得谨慎。(The MongoDB server returns the query results in batches. Batch size will not exceed the maximum BSON document size. For most queries, the first batch returns 101 documents or just enough documents to exceed 1 megabyte. Subsequent batch size is 4 megabytes. To override the default size of the batch, see batchSize() and limit().

For queries that include a sort operation without an index, the server must load all the documents in memory to perform the sort before returning any results.

As you iterate through the cursor and reach the end of the returned batch, if there are more results, cursor.next() will perform a getmore operation to retrieve the next batch. )

Model.all.each { print '.' } 将得到类似的查询:

类似的方案应是利用skiplimit,类似于这样Model.all.skip(m).limit(n)。很不幸的是,数据量过大时,这并不好使,因为随着 skip 值变大,会越来越慢(The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return results. As the offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.)。这让我想起了曾经看过的帖子will_paginate 分页过多 (大概 10000 页),点击最后几页的时候,速度明显变慢,大致原因就是分页底层用到了offset,也是因为offset 越大查询就会越慢。

让我们再次回到ActiveRecordfind_each上来。Rails 考虑得很周全,它底层没利用offset,而是将每批查询的最后一条记录的id作为下批查询的primary_key_offset

# gems/activerecord-4.2.5.1/lib/active_record/relation/batches.rb:98
def find_in_batches(options = {})
options.assert_valid_keys(:start, :batch_size) relation = self
start = options[:start]
batch_size = options[:batch_size] || 1000 unless block_given?
return to_enum(:find_in_batches, options) do
total = start ? where(table[primary_key].gteq(start)).size : size
(total - 1).div(batch_size) + 1
end
end if logger && (arel.orders.present? || arel.taken.present?)
logger.warn("Scoped order and limit are ignored, it's forced to be batch order and batch size")
end relation = relation.reorder(batch_order).limit(batch_size)
records = start ? relation.where(table[primary_key].gteq(start)).to_a : relation.to_a while records.any?
records_size = records.size
primary_key_offset = records.last.id
raise "Primary key not included in the custom select clause" unless primary_key_offset yield records break if records_size < batch_size records = relation.where(table[primary_key].gt(primary_key_offset)).to_a
end
end

前面的Model.all.no_timeout.batch_size(1000).each是 server 端的批量查询,我们也可模仿出client 端的批量查询,即Mongoid版的find_each:

# /config/initializers/mongoid_batches.rb
module Mongoid
module Batches
def find_each(batch_size = 1000)
return to_enum(:find_each, batch_size) unless block_given? find_in_batches(batch_size) do |documents|
documents.each { |document| yield document }
end
end def find_in_batches(batch_size = 1000)
return to_enum(:find_in_batches, batch_size) unless block_given? documents = self.limit(batch_size).asc(:id).to_a
while documents.any?
documents_size = documents.size
primary_key_offset = documents.last.id yield documents break if documents_size < batch_size
documents = self.gt(id: primary_key_offset).limit(batch_size).asc(:id).to_a
end
end
end
end Mongoid::Criteria.include Mongoid::Batches

最后对于耗时操作,还可考虑引入并行计算,类似于这样:

Model.find_each { ... }

Model.find_in_batches do |items|
Parallel.each items, in_processes: 4 do |item|
# ...
end
end

参考链接

  1. https://docs.mongodb.com/manual/core/cursors/
  2. https://docs.mongodb.com/master/reference/method/cursor.skip/
  3. https://ruby-china.org/topics/28659

Mongoid Paging and Iterating Over Large Collections的更多相关文章

  1. Learning part-based templates from large collections of 3D shapse CorrsTmplt Kim 代码调试

    平台: VMware上装的Ubuntu-15.10 环境准备工作:装Fortran, lapack, blas, cblas (理论上装好lapack后面两个应该是自动的),其他的有需要的随时安装就可 ...

  2. Java性能提示(全)

    http://www.onjava.com/pub/a/onjava/2001/05/30/optimization.htmlComparing the performance of LinkedLi ...

  3. EF 5 最佳实践白皮书

    Performance Considerations for Entity Framework 5 By David Obando, Eric Dettinger and others Publish ...

  4. Java 8 Stream API Example Tutorial

    Stream API Overview Before we look into Java 8 Stream API Examples, let’s see why it was required. S ...

  5. Unity 5 Game Optimization (Chris Dickinson 著)

    1. Detecting Performance Issues 2. Scripting Strategies 3. The Benefits of Batching 4. Kickstart You ...

  6. 【转】最佳Restful API 实践

    原文转自:https://bourgeois.me/rest/ REST APIs are a very common topic nowaday; they are part of almost e ...

  7. 【翻译十九】-java之执行器

    Executors In all of the previous examples, there's a close connection between the task being done by ...

  8. 【翻译十七】java-并发之高性能对象

    High Level Concurrency Objects So far, this lesson has focused on the low-level APIs that have been ...

  9. Understanding, Operating and Monitoring Apache Kafka

    Apache Kafka is an attractive service because it's conceptually simple and powerful. It's easy to un ...

随机推荐

  1. iis 7 操作 .net

    下面说一下.NET对IIS7操作.IIS7的操作和IIS5/6(using system.DirectoryServices;使用类DirectoryEntry )有很大的不同,在IIS7里增加了 M ...

  2. 使用selenium启动火狐浏览器,解决Unable to create new remote session问题

    今天用火狐浏览器来做自动化,才启动就报错,提示不能创建新的session,不能启动浏览器 问题原因: 火狐driver与火狐浏览器与selenium版本的不兼容 我使用的火狐driver是0.21.0 ...

  3. SPOJ - ORDERS--- Ordering the Soldiers---根据逆序对求原数组

    题目链接: https://vjudge.net/problem/SPOJ-ORDERS 题目大意: 根据每个数字的逆序对求出原数组 解题思路: 举个例子: n = 5 a[ n ] = { 0, 1 ...

  4. NYOJ 士兵杀敌(1~5)

    士兵杀敌(1): http://acm.nyist.net/JudgeOnline/problem.php?pid=108 分析:前缀和 #include <bits/stdc++.h> ...

  5. linux awk 内置函数详细介绍(实例)

    这节详细介绍awk内置函数,主要分以下3种类似:算数函数.字符串函数.其它一般函数.时间函数 一.算术函数: 以下算术函数执行与 C 语言中名称相同的子例程相同的操作: 函数名 说明 atan2( y ...

  6. 前端面试题(来自前端网http://www.qdfuns.com/notes/23515/c9163ddd620baac5dd23141d41982bb8.html)

    HTML&CSS 1. 常用那几种浏览器测试?有哪些内核(Layout Engine)? (Q1)浏览器:IE,Chrome,FireFox,Safari,Opera. (Q2)内核:Trid ...

  7. Filter,一种aop编程思想的体现

    一.filter简介 filter是Servlet规范里的一个高级特性,只用于对request.response的进行修改. filter提出了FilterChain的概念,客户端请求request在 ...

  8. angular2中一种换肤实现方案

    思路:整体思路是准备多套不同主题的css样式.在anguar项目启动时,首先加载的index.html中先引入一套默认的样式.当我们页面有动作时再切换css.  可以通过url传参触发,也可以通过bu ...

  9. Ubuntu安装MySQL及使用Xshell连接MySQL出现的问题(2003-Can't connect to MySql server及1045错误)

    不管在什么地方,什么时候,学习是快速提升自己的能力的一种体现!!!!!!!!!!! 以下所有的命令都是在root用户下操作(如果还没有设置root密码)如下: 安装好Ubuntu系统之后,打开终端先设 ...

  10. Java - 得到项目中properties属性文件中定义的属性值

    public static String getPropertiesValue(String fileName, String key) {   return ResourceBundle.getBu ...