在讨论怎么去重,提出用 direct buffer 建 btree,想到应该有现成方案,于是找到一个好东西:

MapDB - MapDB : http://www.mapdb.org/

以下来自:kotek.net : http://kotek.net/blog/3G_map

3 billion items in Java Map with 16 GB RAM

One rainy evening I meditated about memory managment in Java and how effectively Java collections utilise memory. I made simple experiment, how much entries can I insert into Java Map with 16 GB of RAM?

Goal of this experiment is to investigate internal overhead of collections. So I decided to use small keys and small values. All tests were made on Linux 64bit Kubuntu 12.04. JVM was 64bit Oracle Java 1.7.0_09-b05 with HotSpot 23.5-b02. There is option to use compressed pointers (-XX:+UseCompressedOops), which is on by default on this JVM.

First is naive test with java.util.TreeMap. It inserts number into map, until it runs out of memory and ends with exception. JVM settings for this test was -Xmx15G

import java.util.*;
Map m = new TreeMap();
for(long counter=0;;counter++){
m.put(counter,"");
if(counter%1000000==0) System.out.println(""+counter);
}

This example ended at 172 milion entries. Near the end insertion rate slowed down thanks to excesive GC activity. On second run I replaced TreeMap with `HashMap, it ended at 182 milions.

Java default collections are not most memory efficient option. So lets try an memory-optimized . I choosed LongHashMap from MapDB, which uses primitive long keys and is optimized to have small memory footprint. JVM settings is again -Xmx15G

import org.mapdb.*
LongMap m = new LongHashMap();
for(long counter=0;;counter++){
m.put(counter,"");
if(counter%1000000==0) System.out.println(""+counter);
}

This time counter stopped at 276 million entries. Again near the end insertion rate slowed down thanks to excesive GC activity.
It looks like this is limit for heap-based collections, Garbage Collection simply brings overhead.

Now is time to pull out the big gun :-). We can always go of-heap where GC can not see our data. Let me introduce you to MapDB, it provides concurrent TreeMap and HashMap backed by database engine. It supports various storage modes, one of them is off-heap memory. (disclaimer: I am MapDB author).

So lets run previous example, but now with off-heap Map. First are few lines to configure and open database, it opens direct-memory store with transactions disabled. Next line creates new Map within the db.

import org.mapdb.*

DB db = DBMaker
.newDirectMemoryDB()
.transactionDisable()
.make(); Map m = db.getTreeMap("test");
for(long counter=0;;counter++){
m.put(counter,"");
if(counter%1000000==0) System.out.println(""+counter);
}

This is off-heap Map, so we need different JVM settings: -XX:MaxDirectMemorySize=15G -Xmx128M. This test runs out of memory at 980 million records.

But MapDB can do better. Problem in previous sample is record fragmentation, b-tree node changes its size on each insert. Workaround is to hold b-tree nodes in cache for short moment before they are inserted. This reduces the record fragmentation to minimum. So lets change DB configuration:

DB db = DBMaker
.newDirectMemoryDB()
.transactionDisable()
.asyncFlushDelay(100)
.make(); Map m = db.getTreeMap("test");

This records runs out of memory with 1 738 million records. Speed is just amazing 1.7 bilion items are inserted within 31 minutes.

MapDB can do even better. Lets increase b-tree node size from 32 to 120 entries and enable transparent compression:

   DB db = DBMaker
.newDirectMemoryDB()
.transactionDisable()
.asyncFlushDelay(100)
.compressionEnable()
.make(); Map m = db.createTreeMap("test",120, false, null, null, null);

This example runs out of memory at whipping 3 315 million records. It is slower thanks to compression, but it still finishes within a few hours. I could probably make some optimization (custom serializers etc) and push number of entries to somewhere around 4 billions.

Maybe you wander how all those entries can fit there. Answer is delta-key compression. Also inserting incremental key (already ordered) into B-Tree is best-case scenario and MapDB is slightly optimized for it. Worst case scenario is inserting keys at random order:

UPDATE added latter: there was bit confusion about compression. Delta-key compression is active by default on all examples. In this example I activated aditional zlib style compression.

    DB db = DBMaker
.newDirectMemoryDB()
.transactionDisable()
.asyncFlushDelay(100)
.make(); Map m = db.getTreeMap("test"); Random r = new Random();
for(long counter=0;;counter++){
m.put(r.nextLong(),"");
if(counter%1000000==0) System.out.println(""+counter);
}

But even with random order MapDB handles to store 651 million records, nearly 4 times more then heap-based collections.

This little excersice does not have much purpose. It is just one of many I do to optimize MapDB. Perhaps most amazing is that insertion speed was actually wery good and MapDB can compete with memory based collections.

用 16G 内存存放 30亿数据(Java Map)转载的更多相关文章

  1. 大数据计算:如何仅用1.5KB内存为十亿对象计数

    大数据计算:如何仅用1.5KB内存为十亿对象计数  Big Data Counting: How To Count A Billion Distinct Objects Using Only 1.5K ...

  2. Java内存区域-- 运行时数据区域

    jvm在执行Java程序时,会把它所管理的内存划分为若干个不同的数据区.这些区域都有各自的用途,以及创建和销毁的时间. 有的区域随着虚拟机进程的启动而存在,有些区域则依赖用户线程的启动和结束而建立和销 ...

  3. Java使用极小的内存完成对超大数据的去重计数,用于实时计算中统计UV

    Java使用极小的内存完成对超大数据的去重计数,用于实时计算中统计UV – lxw的大数据田地 http://lxw1234.com/archives/2015/09/516.htm Java使用极小 ...

  4. java内存结构(执行时数据区域)

    java虚拟机规范规定的java虚拟机内存事实上就是java虚拟机执行时数据区,其架构例如以下: 当中方法区和堆是由全部线程共享的数据区. Java虚拟机栈.本地方法栈和程序计数器是线程隔离的数据区. ...

  5. java内存区域----运行时数据区

    Java虚拟机的内存区域也叫做java运行时数据区,共分为五个部分:程序计数器,方法区,本地方法栈,虚拟机栈和堆.方法区和堆是线程之间所共有的,程序计数器,本地方法栈,虚拟机栈是线程私有的.其中虚拟机 ...

  6. 给定a、b两个文件,各存放50亿个url,每个url各占用64字节,内存限制是4G,如何找出a、b文件共同的url?

    给定a.b两个文件,各存放50亿个url,每个url各占用64字节,内存限制是4G,如何找出a.b文件共同的url? 可以估计每个文件的大小为5G*64=300G,远大于4G.所以不可能将其完全加载到 ...

  7. 从SQL Server到MySQL,近百亿数据量迁移实战

    从SQL Server到MySQL,近百亿数据量迁移实战 狄敬超(3D) 2018-05-29 10:52:48 212 沪江成立于 2001 年,作为较早期的教育学习网站,当时技术选型范围并不大:J ...

  8. Redis基本使用及百亿数据量中的使用技巧分享(附视频地址及观看指南)

    作者:依乐祝 原文地址:https://www.cnblogs.com/yilezhu/p/9941208.html 主讲人:大石头 时间:2018-11-10 晚上20:00 地点:钉钉群(组织代码 ...

  9. Java内存管理-你真的理解Java中的数据类型吗(十)

    勿在流沙筑高台,出来混迟早要还的. 做一个积极的人 编码.改bug.提升自己 我有一个乐园,面向编程,春暖花开! 作为Java程序员,Java 的数据类型这个是一定要知道的! 但是不管是那种数据类型最 ...

  10. JVM 内存区域 (运行时数据区域)

    JVM 内存区域 (运行时数据区域) 链接:https://www.jianshu.com/p/ec479baf4d06 运行时数据区域 Java 虚拟机在执行 Java 程序的过程中会把它所管理的内 ...

随机推荐

  1. LeetCode 1388. Pizza With 3n Slices(3n 块披萨)(DP)

    给你一个披萨,它由 3n 块不同大小的部分组成,现在你和你的朋友们需要按照如下规则来分披萨: 你挑选 任意 一块披萨.Alice 将会挑选你所选择的披萨逆时针方向的下一块披萨.Bob 将会挑选你所选择 ...

  2. 高通ADSP USB流程

    在高通平台上,ADSP(Audio Digital Signal Processor,音频数字信号处理器)可以通过 USB 接口与主机进行数据传输,以下是大致的 ADSP USB 流程: 主机发起 U ...

  3. 数据库运维实操优质文章分享(含Oracle、MySQL等) | 2024年7月刊

    本文为大家整理了墨天轮数据社区2024年7月发布的优质技术文章/文档,主题涵盖Oracle.MySQL.PostgreSQL等主流数据库系统以及国产数据库的深度教程和实用指南.从基础的安装配置到复杂的 ...

  4. 小程序的三大API

    小程序的API有宿主环境提供的 : ps:浏览器的定义对象是 window 而微信中的顶级对象是wx :都是不用声明就能调用 : 1. 事件监听 以on开头,监听事件的触发 eg:onWindowRe ...

  5. 06 导师不敢和你说的水论文隐藏技巧,顶刊、顶会、水刊的论文读哪个,如何做一个称职的学术裁缝.md

    博客配套视频链接: https://www.bilibili.com/video/BV11g41127Zn/?spm_id_from=333.788&vd_source=b1ce52b6eb3 ...

  6. linux命令杂记

    chmod 777 lixiangj 修改目录为共享权限cd .. 跳转上一级目录cd - 跳转上一次跳转的目录ll 查看目录下所有文件ctrl+L 清除屏幕内容| head -10 只看结果中的前1 ...

  7. 项目运行时,tomcat服务器端口被占用

    1.查看tomcat配置文件: 2.查看项目控制台的打印信息: 3.dos命令行解决端口占用 (1)dos命令模式下输入: netstat -ano (进入dos命令:Win + R ,输入cmd ) ...

  8. 推荐一款专为Nginx设计的图形化管理工具: Nginx UI!

    Nginx UI是一款专为Nginx设计的图形化管理工具,旨在简化Nginx的配置与管理过程,提高开发者和系统管理员的工作效率. 项目地址:https://github.com/0xJacky/ngi ...

  9. k8s 中的 Gateway API 的背景和简介【k8s 系列之四】

    〇.Gateway API 的背景 第一阶段:Service 初始的 Kubernetes 内部服务向外暴露,使用的是自身的 LoadBlancer 和 NodePort 类型的 Service. 在 ...

  10. OpenCv Mat 数据结构

    前言 OpenCv的Mat数据结构可以存储图片信息.但是以坐标系构建来说,Mat是以左上角为原点,而我们自己的日常习惯是以左下角为原点. 本文提供了这两者之间的一种转换. 假设 Mat : (x,y) ...