ElasticSearch大批量数据入库

最近着手处理大批量数据的任务。

现状是这样的，一个数据采集程序承载大批量数据的存储和检索。后期可能需要对大批量数据进行统计。

数据分布情况

13个点定时生成采集结果到4个文件（小文件生成周期是5分钟）

名称                                                 大小（b）

gather_1_2014-02-27-14-50-0.txt                      568497

gather_1_2014-02-27-14-50-1.txt                      568665

gather_1_2014-02-27-14-50-2.txt                      568172

gather_1_2014-02-27-14-50-3.txt                      568275

同步使用shell脚本对四个文件入到sybase_iq库的一张表tab_tmp_2014_2_27中.

每天数据量大概是3亿条，所以小文件的总量大概是3G。小文件数量大，单表容量大执行复合主键查询，由原来2s延时变成了，5~10分钟。

针对上述情况需要对目前的储存结构进行优化。

才是看了下相关系统 catior使用的是环状数据库，存储相关的数据优点方便生成MRTG图，缺点不利于数据统计。后来引入elasticsearch来对大数据检索进行优化。

测试平台

cpu: AMD Opteron(tm) Processor 6136 64bit 2.4GHz   * 32

内存: 64G

硬盘：1.5T

操作系统：Red Hat Enterprise Linux Server release 6.4 (Santiago)

读取文件的目录结构：

[test@test001 data]$ ls

0  1  2  3

简单测试代码：

public class FileReader

{

	private File file;

	private String splitCharactor;

	private Map<String, Class<?>> colNames;

	private static final Logger LOG = Logger.getLogger(FileReader.class);

	/**

	 * @param path

	 *            文件路径

	 * @param fileName

	 *            文件名

	 * @param splitCharactor

	 *            拆分字符

	 * @param colNames

	 *            主键名称

	 */

	public FileReader(File file, String splitCharactor, Map<String, Class<?>> colNames)

	{

		this.file = file;

		this.splitCharactor = splitCharactor;

		this.colNames = colNames;

	}

	/**

	 * 读取文件

	 *

	 * @return

	 * @throws Exception

	 */

	public List<Map<String, Object>> readFile() throws Exception

	{

		List<Map<String, Object>> list = new ArrayList<Map<String, Object>>();

		if (!file.isFile())

		{

			throw new Exception("File not exists." + file.getName());

		}

		LineIterator lineIterator = null;

		try

		{

			lineIterator = FileUtils.lineIterator(file, "UTF-8");

			while (lineIterator.hasNext())

			{

				String line = lineIterator.next();

				String[] values = line.split(splitCharactor);

				if (colNames.size() != values.length)

				{

					continue;

				}

				Map<String, Object> map = new HashMap<String, Object>();

				Iterator<Entry<String, Class<?>>> iterator = colNames.entrySet()

						.iterator();

				int count = 0;

				while (iterator.hasNext())

				{

					Entry<String, Class<?>> entry = iterator.next();

					Object value = values[count];

					if (!String.class.equals(entry.getValue()))

					{

						value = entry.getValue().getMethod("valueOf", String.class)

								.invoke(null, value);

					}

					map.put(entry.getKey(), value);

					count++;

				}

				list.add(map);

			}

		}

		catch (IOException e)

		{

			LOG.error("File reading line error." + e.toString(), e);

		}

		finally

		{

			LineIterator.closeQuietly(lineIterator);

		}

		return list;

	}

}

public class StreamIntoEs

{

	public static class ChildThread extends Thread

	{

		int number;

		public ChildThread(int number)

		{

			this.number = number;

		}

		@Override

		public void run()

		{

			Settings settings = ImmutableSettings.settingsBuilder()

					.put("client.transport.sniff", true)

					.put("client.transport.ping_timeout", 100)

					.put("cluster.name", "elasticsearch").build();

			TransportClient client = new TransportClient(settings)

					.addTransportAddress(new InetSocketTransportAddress("192.168.32.228",

							9300));

			File dir = new File("/export/home/es/data/" + number);

			LinkedHashMap<String, Class<?>> colNames = new LinkedHashMap<String, Class<?>>();

			colNames.put("aa", Long.class);

			colNames.put("bb", String.class);

			colNames.put("cc", String.class);

			colNames.put("dd", Integer.class);

			colNames.put("ee", Long.class);

			colNames.put("ff", Long.class);

			colNames.put("hh", Long.class);

			int count = 0;

			long startTime = System.currentTimeMillis();

			for (File file : dir.listFiles())

			{

				int currentCount = 0;

				long startCurrentTime = System.currentTimeMillis();

				FileReader reader = new FileReader(file, "\\$", colNames);

				BulkResponse resp = null;

				<strong>BulkRequestBuilder bulkRequest = client.prepareBulk();</strong>

				try

				{

					List<Map<String, Object>> results = reader.readFile();

					for (Map<String, Object> col : results)

					{

						bulkRequest.add(client.prepareIndex("flux", "fluxdata")

								.setSource(JSON.toJSONString(col)).setId(col.get("getway")+"##"+col.get("port_info")+"##"+col.get("device_id")+"##"+col.get("collecttime")));

						count++;

						currentCount++;

					}

					resp = bulkRequest.execute().actionGet();

				}

				catch (Exception e)

				{

					// TODO Auto-generated catch block

					e.printStackTrace();

				}

				long endCurrentTime = System.currentTimeMillis();

				System.out.println("[thread-" + number + "-]per count:" + currentCount);

				System.out.println("[thread-" + number + "-]per time:"

						+ (endCurrentTime - startCurrentTime));

				System.out.println("[thread-" + number + "-]per count/s:"

						+ (float) currentCount / (endCurrentTime - startCurrentTime)

						* 1000);

				System.out.println("[thread-" + number + "-]per count/s:"

						+ resp.toString());

			}

			long endTime = System.currentTimeMillis();

			System.out.println("[thread-" + number + "-]total count:" + count);

			System.out.println("[thread-" + number + "-]total time:"

					+ (endTime - startTime));

			System.out.println("[thread-" + number + "-]total count/s:" + (float) count

					/ (endTime - startTime) * 1000);

			// IndexRequest request =

			// = client.index(request);

		}

	}

	public static void main(String args[])

	{

		for (int i = 0; i < 4; i++)

		{

			ChildThread childThread = new ChildThread(i);

			childThread.start();

		}

	}

}

起了4个线程来做入库,每个文件解析完成进行一次批处理。

初始化脚本：

curl -XDELETE 'http://192.168.32.228:9200/twitter/'

curl -XPUT 'http://192.168.32.228:9200/twitter/' -d '

{

     "index" :{

          "number_of_shards" : 5,

          "number_of_replicas ": 0,

          <strong>"index.refresh_interval": "-1",

         "index.translog.flush_threshold_ops": "100000"</strong>

     }

}'

curl -XPUT 'http://192.168.32.228:9200/twiter/twiterdata/_mapping' -d '

{

             "<span style="font-size: 1em; line-height: 1.5;">twiterdata</span><span style="font-size: 1em; line-height: 1.5;">": {</span>

                    "aa" : {"type" : "long", "index" : "not_analyzed"},

                    "bb" : {"type" : "String", "index" : "not_analyzed"},

                    "cc" : {"type" : "String", "index" : "not_analyzed"},

                    "dd" : {"type" : "integer", "index" : "not_analyzed"},

                    "ee" : {"type" : "long", "index" : "no"},

                    "ff" : {"type" : "long", "index" : "no"},

                    "gg" : {"type" : "long", "index" : "no"},

                    "hh" : {"type" : "long", "index" : "no"},

                    "ii" : {"type" : "long", "index" : "no"},

                    "jj" : {"type" : "long", "index" : "no"},

                    "kk" : {"type" : "long", "index" : "no"},

                }

}

执行效率参考：

不开启refresh_interval

[test@test001 bin]$ more StreamIntoEs.out|grep total

[thread-2-]total count:1199411

[thread-2-]total time:1223718

[thread-2-]total count/s:980.1368

[thread-1-]total count:1447214

[thread-1-]total time:1393528

[thread-1-]total count/s:1038.5253

[thread-0-]total count:1508043

[thread-0-]total time:1430167

[thread-0-]total count/s:1054.4524

[thread-3-]total count:1650576

[thread-3-]total time:1471103

[thread-3-]total count/s:1121.9989

4195.1134

开启refresh_interval

[test@test001 bin]$ more StreamIntoEs.out |grep total

[thread-2-]total count:1199411

[thread-2-]total time:996111

[thread-2-]total count/s:1204.0938

[thread-1-]total count:1447214

[thread-1-]total time:1163207

[thread-1-]total count/s:1244.1586

[thread-0-]total count:1508043

[thread-0-]total time:1202682

[thread-0-]total count/s:1253.9

[thread-3-]total count:1650576

[thread-3-]total time:1236239

[thread-3-]total count/s:1335.1593

5037.3117

开启refresh_interval  字段类型转换

[test@test001 bin]$ more StreamIntoEs.out |grep total

[thread-2-]total count:1199411

[thread-2-]total time:1065229

[thread-2-]total count/s:1125.9653

[thread-1-]total count:1447214

[thread-1-]total time:1218342

[thread-1-]total count/s:1187.8552

[thread-0-]total count:1508043

[thread-0-]total time:1230474

[thread-0-]total count/s:1225.5789

[thread-3-]total count:1650576

[thread-3-]total time:1274027

[thread-3-]total count/s:1295.5581

4834.9575

开启refresh_interval  字段类型转换 设置id

[thread-2-]total count:1199411

[thread-2-]total time:912251

[thread-2-]total count/s:1314.7817

[thread-1-]total count:1447214

[thread-1-]total time:1067117

[thread-1-]total count/s:1356.1906

[thread-0-]total count:1508043

[thread-0-]total time:1090577

[thread-0-]total count/s:1382.7937

[thread-3-]total count:1650576

[thread-3-]total time:1128490

[thread-3-]total count/s:1462.6412

5516.4072

580M的数据平均用时大概是20分钟。索引文件大约为1.76G

elasticsearch 性能测试

ElasticSearch大批量数据入库的更多相关文章

ODP方式，大批量数据写入ORACLE数据库
项目中在同步数据的时候,需要把获得的数据DataTable,写入oracle数据库因为System.Data.OracleClient写入方式写入大批量数据特别慢,改用Oracle.DataAcce ...
使用hive访问elasticsearch的数据
使用hive访问elasticsearch的数据 1.配置将elasticsearch-hadoop-2.1.1.jar拷贝到hive/lib hive -hiveconf hive.aux.jar ...
使用spark访问elasticsearch的数据
使用spark访问elasticsearch的数据,前提是spark能访问hive,hive能访问es http://blog.csdn.net/ggz631047367/article/detail ...
oracle 快速删除大批量数据方法（全部删除，条件删除，删除大量重复记录）
oracle 快速删除大批量数据方法(全部删除,条件删除,删除大量重复记录) 分类: ORACLE 数据库 2011-05-24 16:39 8427人阅读评论(2) 收藏举报 oracledel ...
Sql Server数据库使用触发器和sqlbulkcopy大批量数据插入更新
需要了解的知识 1.触发器 2.sqlbulkcopy 我的用途开发数据库同步的工具,需要大批量数据插入和数据更新. 方式使用SqlBulkCopy类对数据进行数据批量复制,将需要同步数据的表新建 ...
Java实现大批量数据导入导出(100W以上)　-（二）导出
使用POI或JXLS导出大数据量(百万级)Excel报表常常面临两个问题: 1. 服务器内存溢出: 2. 一次从数据库查询出这么大数据,查询缓慢. 当然也可以分页查询出数据,分别生成多个Excel打包 ...
Java实现大批量数据导入导出(100W以上)　-（一）导入
最近业务方有一个需求,需要一次导入超过100万数据到系统数据库.可能大家首先会想,这么大的数据,干嘛通过程序去实现导入,为什么不直接通过SQL导入到数据库. 大数据量报表导出请参考:Java实现大批量 ...
elasticsearch(3) 数据操作-更新
一更新整个文档更新整个文档的方法和存放数据的方式是相同的,通过PUT 127.0.0.1/test/test/1 我们可以把test/test/1下的文档更新为新的文档例: PUT 127.0 ...
oracle数据入库
oracle数据入库注意:先要处理文件中的分隔符将数据分列创建为标准的sql语句 1.在oracle数据库中创建要入库的表如果有该表则不用创建(注:创建字段的数据类型要符合实际逻辑 va ...

随机推荐

怎样用jQuery自带方法/函数来获取outerHTML属性
原文地址:http://jingyan.baidu.com/article/7f41ececf93b48593d095c25.html 包括我自己在内(其实我也就这两天才知道这样可以快速获取的),很多 ...
day49
几天没写了这几天比较麻木呢各种课程的再看想买一直不舍得money 今天下定决心买了这样我也静下心好好备战把一天背的东西好多政治和作文也是背了就忘记尽力把今天的买的课很悬乎就不在这说了 ...
css备忘录（关于relative、absolute）
父级用:position: relative; 子级才能用:position: absolute; relative里面用margin调位置: absolute里面用top.left.right.bo ...
CF Zepto Code Rush 2014 B. Om Nom and Spiders
Om Nom and Spiders time limit per test 3 seconds memory limit per test 256 megabytes input standard ...
Java 下各种设计模式小结
策略模式--定义算法族,分别封装起来,让它们之间能够互相替换,此模式让算法的变化独立于使用算法的客户. 策略模式是说,针对一种计算,定义一系列的算法,由用户决定详细使用哪一个算法完毕计算. 比 ...
UVA 825 Walking on the Safe Side(记忆化搜索)
Walking on the Safe Side Square City is a very easy place for people to walk around. The two-way ...
VMware vSphere 5.5的12个更新亮点(3)
端口镜像有时有必要捕捉网络中的数据包来追踪问题.最新版本的vSphere包括一个增强版的开源数据包分析器tcpdump和一些镜像端口的选项以捕捉各种场所的流量.您可以捕获虚拟网卡,虚拟交换机,以及主 ...
crtmpserver通常使用基本类演示
以前我们做了分析过程,这一次,我们都参与了类做梳子,两个可以一起关注一下一起合并,整个方案的实施是有帮助. BaseClientApplication APP基类,一切APP都基于这个类 Stream ...
《第一行代码》学习笔记20－广播接收器Broadcast_Receiver（3）
1.强制用户下线的逻辑并不是写在MainActivity里的,而是应该写在接收这条广播的广播接收器里面,这样强制下线的功能就不会依附于任何的界面,不管是在程序的任何地方,只需要发出一条广播,就可以完 ...
Geodatabase - 删除要素
//删除要素类. //例如:workspacePath=@"G:\doc\gis\1.400\data\pdb.mdb", featureClassPath="res2_ ...

ElasticSearch大批量数据入库

elasticsearch 性能测试

ElasticSearch大批量数据入库的更多相关文章

随机推荐

热门专题