最近在用poi读取大容量excel,发现只要是excel文件大于2M左右,便会出现OOM(out of memory),经过查询得知,原来poi读取excel的原理是如下:

org.apache.poi.xssf.usermodel.XSSFWorkbook.XSSFWorkbook(InputStream is) throws IOException
采用usermodel,这种方式是以dom方式读取excel,好处是读取方便,不足是一次性将文件加载到内存中,容易造成OOM;第二种模型,eventusermodel,这种方式采用事件驱动的方法解析xml,在遇到文件内容时,事件会触发,这种做法可以大大降低内存的消耗。以下是我整理的读取大容量excel2003(.xls)和excel2007(.xlsx)的例子,仅供学习
其他资料参考:
 

package com.speed.excel;

import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.poi.hssf.eventusermodel.EventWorkbookBuilder.SheetRecordCollectingListener;
import org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener;
import org.apache.poi.hssf.eventusermodel.HSSFEventFactory;
import org.apache.poi.hssf.eventusermodel.HSSFListener;
import org.apache.poi.hssf.eventusermodel.HSSFRequest;
import org.apache.poi.hssf.eventusermodel.MissingRecordAwareHSSFListener;
import org.apache.poi.hssf.eventusermodel.dummyrecord.LastCellOfRowDummyRecord;
import org.apache.poi.hssf.eventusermodel.dummyrecord.MissingCellDummyRecord;
import org.apache.poi.hssf.model.HSSFFormulaParser;
import org.apache.poi.hssf.record.BOFRecord;
import org.apache.poi.hssf.record.BlankRecord;
import org.apache.poi.hssf.record.BoolErrRecord;
import org.apache.poi.hssf.record.BoundSheetRecord;
import org.apache.poi.hssf.record.FormulaRecord;
import org.apache.poi.hssf.record.LabelRecord;
import org.apache.poi.hssf.record.LabelSSTRecord;
import org.apache.poi.hssf.record.NumberRecord;
import org.apache.poi.hssf.record.Record;
import org.apache.poi.hssf.record.SSTRecord;
import org.apache.poi.hssf.record.StringRecord;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;

/**
* 抽象Excel2003读取器,通过实现HSSFListener监听器,采用事件驱动模式解析excel2003
* 中的内容,遇到特定事件才会触发,大大减少了内存的使用。
*
*/
public class Excel2003Reader implements HSSFListener{
private int minColumns = -1;
private POIFSFileSystem fs;
private int lastRowNumber;
private int lastColumnNumber;

/** Should we output the formula, or the value it has? */
private boolean outputFormulaValues = true;

/** For parsing Formulas */
private SheetRecordCollectingListener workbookBuildingListener;
//excel2003工作薄
private HSSFWorkbook stubWorkbook;

// Records we pick up as we process
private SSTRecord sstRecord;
private FormatTrackingHSSFListener formatListener;

//表索引
private int sheetIndex = -1;
private BoundSheetRecord[] orderedBSRs;
@SuppressWarnings("unchecked")
private ArrayList boundSheetRecords = new ArrayList();

// For handling formulas with string results
private int nextRow;
private int nextColumn;
private boolean outputNextStringRecord;
//当前行
private int curRow = 0;
//存储行记录的容器
private List<String> rowlist = new ArrayList<String>();;
@SuppressWarnings( "unused")
private String sheetName;

private IRowReader rowReader;

public void setRowReader(IRowReader rowReader){
this.rowReader = rowReader;
}

/**
* 遍历excel下所有的sheet
* @throws IOException
*/
public void process(String fileName) throws IOException {
this.fs = new POIFSFileSystem(new FileInputStream(fileName));
MissingRecordAwareHSSFListener listener = new MissingRecordAwareHSSFListener(
this);
formatListener = new FormatTrackingHSSFListener(listener);
HSSFEventFactory factory = new HSSFEventFactory();
HSSFRequest request = new HSSFRequest();
if (outputFormulaValues) {
request.addListenerForAllRecords(formatListener);
} else {
workbookBuildingListener = new SheetRecordCollectingListener(
formatListener);
request.addListenerForAllRecords(workbookBuildingListener);
}
factory.processWorkbookEvents(request, fs);
}

/**
* HSSFListener 监听方法,处理 Record
*/
@SuppressWarnings("unchecked")
public void processRecord(Record record) {
int thisRow = -1;
int thisColumn = -1;
String thisStr = null;
String value = null;
switch (record.getSid()) {
case BoundSheetRecord.sid:
boundSheetRecords.add(record);
break;
case BOFRecord.sid:
BOFRecord br = (BOFRecord) record;
if (br.getType() == BOFRecord.TYPE_WORKSHEET) {
// 如果有需要,则建立子工作薄
if (workbookBuildingListener != null && stubWorkbook == null) {
stubWorkbook = workbookBuildingListener
.getStubHSSFWorkbook();
}

sheetIndex++;
if (orderedBSRs == null) {
orderedBSRs = BoundSheetRecord
.orderByBofPosition(boundSheetRecords);
}
sheetName = orderedBSRs[sheetIndex].getSheetname();
}
break;

case SSTRecord.sid:
sstRecord = (SSTRecord) record;
break;

case BlankRecord.sid:
BlankRecord brec = (BlankRecord) record;
thisRow = brec.getRow();
thisColumn = brec.getColumn();
thisStr = "";
rowlist.add(thisColumn, thisStr);
break;
case BoolErrRecord.sid: //单元格为布尔类型
BoolErrRecord berec = (BoolErrRecord) record;
thisRow = berec.getRow();
thisColumn = berec.getColumn();
thisStr = berec.getBooleanValue()+"";
rowlist.add(thisColumn, thisStr);
break;

case FormulaRecord.sid: //单元格为公式类型
FormulaRecord frec = (FormulaRecord) record;
thisRow = frec.getRow();
thisColumn = frec.getColumn();
if (outputFormulaValues) {
if (Double.isNaN(frec.getValue())) {
// Formula result is a string
// This is stored in the next record
outputNextStringRecord = true;
nextRow = frec.getRow();
nextColumn = frec.getColumn();
} else {
thisStr = formatListener.formatNumberDateCell(frec);
}
} else {
thisStr = '"' + HSSFFormulaParser.toFormulaString(stubWorkbook,
frec.getParsedExpression()) + '"';
}
rowlist.add(thisColumn,thisStr);
break;
case StringRecord.sid://单元格中公式的字符串
if (outputNextStringRecord) {
// String for formula
StringRecord srec = (StringRecord) record;
thisStr = srec.getString();
thisRow = nextRow;
thisColumn = nextColumn;
outputNextStringRecord = false;
}
break;
case LabelRecord.sid:
LabelRecord lrec = (LabelRecord) record;
curRow = thisRow = lrec.getRow();
thisColumn = lrec.getColumn();
value = lrec.getValue().trim();
value = value.equals("")?" ":value;
this.rowlist.add(thisColumn, value);
break;
case LabelSSTRecord.sid: //单元格为字符串类型
LabelSSTRecord lsrec = (LabelSSTRecord) record;
curRow = thisRow = lsrec.getRow();
thisColumn = lsrec.getColumn();
if (sstRecord == null) {
rowlist.add(thisColumn, " ");
} else {
value = sstRecord
.getString(lsrec.getSSTIndex()).toString().trim();
value = value.equals("")?" ":value;
rowlist.add(thisColumn,value);
}
break;
case NumberRecord.sid: //单元格为数字类型
NumberRecord numrec = (NumberRecord) record;
curRow = thisRow = numrec.getRow();
thisColumn = numrec.getColumn();
value = formatListener.formatNumberDateCell(numrec).trim();
value = value.equals("")?" ":value;
// 向容器加入列值
rowlist.add(thisColumn, value);
break;
default:
break;
}

// 遇到新行的操作
if (thisRow != -1 && thisRow != lastRowNumber) {
lastColumnNumber = -1;
}

// 空值的操作
if (record instanceof MissingCellDummyRecord) {
MissingCellDummyRecord mc = (MissingCellDummyRecord) record;
curRow = thisRow = mc.getRow();
thisColumn = mc.getColumn();
rowlist.add(thisColumn," ");
}

// 更新行和列的值
if (thisRow > -1)
lastRowNumber = thisRow;
if (thisColumn > -1)
lastColumnNumber = thisColumn;

// 行结束时的操作
if (record instanceof LastCellOfRowDummyRecord) {
if (minColumns > 0) {
// 列值重新置空
if (lastColumnNumber == -1) {
lastColumnNumber = 0;
}
}
lastColumnNumber = -1;
// 每行结束时, 调用getRows() 方法
rowReader.getRows(sheetIndex,curRow, rowlist);

// 清空容器
rowlist.clear();
}
}

}

package com.speed.excel;

import java.io.InputStream;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

/**
* 抽象Excel2007读取器,excel2007的底层数据结构是xml文件,采用SAX的事件驱动的方法解析
* xml,需要继承DefaultHandler,在遇到文件内容时,事件会触发,这种做法可以大大降低
* 内存的耗费,特别使用于大数据量的文件。
*
*/
public class Excel2007Reader extends DefaultHandler {
//共享字符串表
private SharedStringsTable sst;
//上一次的内容
private String lastContents;
private boolean nextIsString;

private int sheetIndex = -1;
private List<String> rowlist = new ArrayList<String>();
//当前行
private int curRow = 0;
//当前列
private int curCol = 0;

private IRowReader rowReader;

public void setRowReader(IRowReader rowReader){
this.rowReader = rowReader;
}

/**只遍历一个电子表格,其中sheetId为要遍历的sheet索引,从1开始,1-3
* @param filename
* @param sheetId
* @throws Exception
*/
public void processOneSheet(String filename,int sheetId) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader(pkg);
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);

// 根据 rId# 或 rSheet# 查找sheet
InputStream sheet2 = r.getSheet("rId"+sheetId);
sheetIndex++;
InputSource sheetSource = new InputSource(sheet2);
parser.parse(sheetSource);
sheet2.close();
}

/**
* 遍历工作簿中所有的电子表格
* @param filename
* @throws Exception
*/
public void process(String filename) throws Exception {
try {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader(pkg);
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
Iterator<InputStream> sheets = r.getSheetsData();
while (sheets.hasNext()) {
curRow = 0;
sheetIndex++;
InputStream sheet = sheets.next();
InputSource sheetSource = new InputSource(sheet);
parser.parse(sheetSource);
sheet.close();
}
} catch (Exception e) {
throw new IllegalArgumentException("读取文件失败,此文件可能已损坏,请参照模板重新上传!");
}
}

public XMLReader fetchSheetParser(SharedStringsTable sst)
throws SAXException {
XMLReader parser = XMLReaderFactory
.createXMLReader("org.apache.xerces.parsers.SAXParser");
this.sst = sst;
parser.setContentHandler(this);
return parser;
}

public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {

// c => 单元格
if ("c".equals(name)) {
// 如果下一个元素是 SST 的索引,则将nextIsString标记为true
String cellType = attributes.getValue("t");
if ("s".equals(cellType)) {
nextIsString = true;
} else {
nextIsString = false;
}
}
// 置空
lastContents = "";
}

public void endElement(String uri, String localName, String name)
throws SAXException {

// 根据SST的索引值的到单元格的真正要存储的字符串
// 这时characters()方法可能会被调用多次
if (nextIsString) {
try {
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx))
.toString();
} catch (Exception e) {

}
}
if ("v".equals(name)) {
String value = lastContents.trim();
value = value.equals("")?" ":value;
rowlist.add(curCol, value);
curCol++;
}else {
//如果标签名称为 row ,这说明已到行尾,调用 optRows() 方法
if (name.equals("row")) {
rowReader.getRows(sheetIndex,curRow,rowlist);
rowlist.clear();
curRow++;
curCol = 0;
}
}
}

public void characters(char[] ch, int start, int length)
throws SAXException {
//得到单元格内容的值
lastContents += new String(ch, start, length);
}
}

java读取大容量excel之一的更多相关文章

  1. java读取大容量excel之二(空格、空值问题)

    最近在项目中发现,对于Excel2007(底层根本是xml) ,使用<java读取大容量excel之一>中的方式读取,若待读取的excel2007文件中某一列是空值,(注意,所谓的空值是什 ...

  2. java - 读取,导出 excel文件数据

    首先需下载poi java包,添加至构建路径, 写处理方法: import java.io.FileInputStream;import java.io.FileOutputStream;import ...

  3. Java读取批量Excel文件

    1.首先基础知识: 原文链接:https://blog.csdn.net/baidu_39298625/article/details/105842725 一 :简介 开发中经常会设计到excel的处 ...

  4. java读取大文件 超大文件的几种方法

    java 读取一个巨大的文本文件既能保证内存不溢出又能保证性能       import java.io.BufferedReader; import java.io.File; import jav ...

  5. Java读取Excel文件的几种方法

    Java读取 Excel 文件的常用开源免费方法有以下几种: 1. JDBC-ODBC Excel Driver 2. jxl.jar 3. jcom.jar 4. poi.jar 简单介绍: 百度文 ...

  6. JAVA读取EXCEL文件异常Unable to recognize OLE stream

    异常: jxl.read.biff.BiffException: Unable to recognize OLE stream at jxl.read.biff.CompoundFile.<in ...

  7. Java读取excel表格

    Java读取excel表格 一般都是用poi技术去读取excel表格的,但是这个技术又是什么呢 什么是Apache POI? Apache POI是一种流行的API,它允许程序员使用Java程序创建, ...

  8. java读取excel文件的代码

    如下内容段是关于java读取excel文件的内容,应该能对各朋友有所用途. package com.zsmj.utilit; import java.io.FileInputStream;import ...

  9. java操作office和pdf文件java读取word,excel和pdf文档内容

    在平常应用程序中,对office和pdf文档进行读取数据是比较常见的功能,尤其在很多web应用程序中.所以今天我们就简单来看一下Java对word.excel.pdf文件的读取.本篇博客只是讲解简单应 ...

随机推荐

  1. 【eclipse】聚合工程maven启动Tomcat报错

    严重: Error configuring application listener of class严重: Skipped installing application listeners due ...

  2. 联想笔记本thinkpad按F2不能直接重命名

    联想笔记本thinkpad按F2不能直接重命名,而是Fn+F2. 解决: 按一下Fn+Esc(一般在左上角)

  3. undefined!=false之解 及==比较的规则

    JS中有一个基本概念就是: JavaScript中undefined==null 但undefined!==null undefined与null转换成布尔值都是false 如果按照常规想法,比如下面 ...

  4. 机器学习-数据可视化神器matplotlib学习之路(四)

    今天画一下3D图像,首先的另外引用一个包 from mpl_toolkits.mplot3d import Axes3D,接下来画一个球体,首先来看看球体的参数方程吧 (0≤θ≤2π,0≤φ≤π) 然 ...

  5. C#在服务端验证客户端证书(Certificate)

    使用https协议进行通讯的时候可以设置双向证书认证,客户端验证服务端证书的方法前面已经介绍过了,现在说一下在服务端验证客户端证书的方案. 这里给出的方案比较简单,只需要在Service端的配置文件中 ...

  6. Intel微处理器学习笔记(三) 不可见寄存器

    参考资料: 1.  http://blog.chinaunix.net/uid-20797642-id-2495244.html 2.  http://www.techbulo.com/708.htm ...

  7. ubuntu 16.04 u盘挂载以及卸载

    1.列出所有磁盘 sudo fdisk -l 2.最后一段信息显示的为u盘 Device Boot Start End Sectors Size Id Type /dev/sdb4 * 256 786 ...

  8. jsjl_for_ubuntu12.04

    1. VC++代码: #include <stdio.h> #include <windows.h> #include <wchar.h> void MoveMou ...

  9. hihoCoder 1636 Pangu and Stones

    hihoCoder 1636 Pangu and Stones 思路:区间dp. 状态:dp[i][j][k]表示i到j区间合并成k堆石子所需的最小花费. 初始状态:dp[i][j][j-i+1]=0 ...

  10. Codeforces 847B - Preparing for Merge Sort

    847B - Preparing for Merge Sort 思路:前面的排序的最后一个一定大于后面的排序的最后一个.所以判断要不要开始新的排序只要拿当前值和上一个排序最后一个比较就可以了. 代码: ...