java读取大容量excel之二(空格、空值问题)
最近在项目中发现,对于Excel2007(底层根本是xml) ,使用《java读取大容量excel之一》中的方式读取,若待读取的excel2007文件中某一列是空值,(注意,所谓的空值是什么都没有,而不是敲了空格后的形式,虽然敲了空格表面上看也没有值,但是有占位符,在底层xml中表现不一样),后出现自动忽略,若一个电子表格中有五列,而2,3列没值,则会读不到,这样会对于若程序后续对拿到的每一行数据进行判断,则会张冠李戴,容易出错,故研究了下.xlsx底层,如下:
如图新建一个.xlsx文件,包含三行,首行是五个单元格合并为一个单元格,第二行为五个标题,全部有值,第三行仅仅第五列有值,注意,其他列每空格,即本文所述的空值,下面我们来看一下该excel文件的xml格式(ps:直接将.xlsx文件重命名为.zip文件打开即可)找到里面的xl>worksheets文件夹,打开sheet1.xml,如下:
1 <?xml version="1.0" encoding="UTF-8" standalone="yes" ?> 2 - <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac"> 3 <dimension ref="A1:E3" /> 4 - <sheetViews> 5 - <sheetView tabSelected="1" workbookViewId="0"> 6 <selection activeCell="E3" sqref="E3" /> 7 </sheetView> 8 </sheetViews> 9 <sheetFormatPr defaultRowHeight="13.5" x14ac:dyDescent="0.15" /> 10 - <sheetData> 11 - <row r="1" spans="1:5" x14ac:dyDescent="0.15"> 12 - <c r="A1" s="1" t="s"> 13 <v>5</v> 14 </c> 15 <c r="B1" s="1" /> 16 <c r="C1" s="1" /> 17 <c r="D1" s="1" /> 18 <c r="E1" s="1" /> 19 </row> 20 - <row r="2" spans="1:5" x14ac:dyDescent="0.15"> 21 - <c r="A2" s="2" t="s"> 22 <v>0</v> 23 </c> 24 - <c r="B2" s="2" t="s"> 25 <v>1</v> 26 </c> 27 - <c r="C2" s="2" t="s"> 28 <v>2</v> 29 </c> 30 - <c r="D2" s="2" t="s"> 31 <v>3</v> 32 </c> 33 - <c r="E2" s="2" t="s"> 34 <v>4</v> 35 </c> 36 </row> 37 - <row r="3" spans="1:5" x14ac:dyDescent="0.15"> 38 - <c r="E3" s="2"> 39 <v>10000</v> 40 </c> 41 </row> 42 </sheetData> 43 - <mergeCells count="1"> 44 <mergeCell ref="A1:E1" /> 45 </mergeCells> 46 <phoneticPr fontId="1" type="noConversion" /> 47 <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3" /> 48 <pageSetup paperSize="9" orientation="portrait" horizontalDpi="96" verticalDpi="96" r:id="rId1" /> 49 </worksheet>
sheet1.xml
下面我们仅仅是在刚刚的excel中E3单元格输入几个空格,则看一下xml
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
mc:Ignorable="x14ac"
xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<dimension ref="A1:E3" />
<sheetViews>
<sheetView tabSelected="1" workbookViewId="0">
<selection activeCell="A3" sqref="A3" />
</sheetView>
</sheetViews>
<sheetFormatPr defaultRowHeight="13.5" x14ac:dyDescent="0.15" />
<sheetData>
<row r="1" spans="1:5" x14ac:dyDescent="0.15">
<c r="A1" s="2" t="s">
<v>5</v>
</c>
<c r="B1" s="2" />
<c r="C1" s="2" />
<c r="D1" s="2" />
<c r="E1" s="2" />
</row>
<row r="2" spans="1:5" x14ac:dyDescent="0.15">
<c r="A2" s="1" t="s">
<v>0</v>
</c>
<c r="B2" s="1" t="s">
<v>1</v>
</c>
<c r="C2" s="1" t="s">
<v>2</v>
</c>
<c r="D2" s="1" t="s">
<v>3</v>
</c>
<c r="E2" s="1" t="s">
<v>4</v>
</c>
</row>
<row r="3" spans="1:5" x14ac:dyDescent="0.15">
<c r="A3" t="s"><!-- 注意,这是在该单元格敲了几个空格后的,其xml底层存储即为有值,t="s",与什么也没输不一样!!! -->
<v>6</v>
</c>
<c r="E3" s="1">
<v>10000</v>
</c>
</row>
</sheetData>
<mergeCells count="1">
<mergeCell ref="A1:E1" />
</mergeCells>
<phoneticPr fontId="1" type="noConversion" />
<pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75"
header="0.3" footer="0.3" />
<pageSetup paperSize="9" orientation="portrait"
horizontalDpi="96" verticalDpi="96" r:id="rId1" />
</worksheet>
sheet2.xml
通过两个对比发现,.xlsx文件底层对于空值和空格是区别对待的,切不可同日而语,于是问题来的,对于我们要解析的excel2007中要是含有空值怎么办,(有空格不怕,会按照该列有值对待,只不过读出后是空,且顺序不会变化,基本没什么影响)下面是一种解决方案,仅供参考
/* ====================================================================
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==================================================================== */
package com.speed.excel;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintStream;
import java.util.ArrayList;
import java.util.List;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.BuiltinFormats;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFCellStyle;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
/**
* A rudimentary XLSX -> CSV processor modeled on the
* POI sample program XLS2CSVmra by Nick Burch from the
* package org.apache.poi.hssf.eventusermodel.examples.
* Unlike the HSSF version, this one completely ignores
* missing rows.
* <p/>
* Data sheets are read using a SAX parser to keep the
* memory footprint relatively small, so this should be
* able to read enormous workbooks. The styles table and
* the shared-string table must be kept in memory. The
* standard POI styles table class is used, but a custom
* (read-only) class is used for the shared string table
* because the standard POI SharedStringsTable grows very
* quickly with the number of unique strings.
* <p/>
* Thanks to Eric Smith for a patch that fixes a problem
* triggered by cells with multiple "t" elements, which is
* how Excel represents different formats (e.g., one word
* plain and one word bold).
*
* @author Chris Lott
*/
public class XLSX2CSV {
/**
* The type of the data value is indicated by an attribute on the cell.
* The value is usually in a "v" element within the cell.
*/
enum xssfDataType {
BOOL,
ERROR,
FORMULA,
INLINESTR,
SSTINDEX,
NUMBER,
}
/**
* Derived from http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api
* <p/>
* Also see Standard ECMA-376, 1st edition, part 4, pages 1928ff, at
* http://www.ecma-international.org/publications/standards/Ecma-376.htm
* <p/>
* A web-friendly version is http://openiso.org/Ecma/376/Part4
*/
class MyXSSFSheetHandler extends DefaultHandler {
/**
* Table with styles
*/
private StylesTable stylesTable;
/**
* Table with unique strings
*/
private ReadOnlySharedStringsTable sharedStringsTable;
/**
* Destination for data
*/
private final PrintStream output;
/**
* Number of columns to read starting with leftmost
*/
private final int minColumnCount;
// Set when V start element is seen
private boolean vIsOpen;
// Set when cell start element is seen;
// used when cell close element is seen.
private xssfDataType nextDataType;
// Used to format numeric cell values.
private short formatIndex;
private String formatString;
private final DataFormatter formatter;
// 定义当前读到的列数,实际读取时会按照从0开始...
private int thisColumn = -1;
// 定义上一次读到的列序号
private int lastColumnNumber = -1;
// Gathers characters as they are seen.
private StringBuffer value;
// 定义存储每行内容的list
private List<String> rowlist = new ArrayList<String>();
// 定义当前读到的列与上一次读到的列中是否有空值(即该单元格什么也没有输入,连空格都不存在)默认为false
private boolean flag = false ;
/**
* Accepts objects needed while parsing.
*
* @param styles Table of styles
* @param strings Table of shared strings
* @param cols Minimum number of columns to show
* @param target Sink for output
*/
public MyXSSFSheetHandler(
StylesTable styles,
ReadOnlySharedStringsTable strings,
int cols,
PrintStream target) {
this.stylesTable = styles;
this.sharedStringsTable = strings;
this.minColumnCount = cols;
this.output = target;
this.value = new StringBuffer();
this.nextDataType = xssfDataType.NUMBER;
this.formatter = new DataFormatter();
}
/*
* (non-Javadoc)
* @see org.xml.sax.helpers.DefaultHandler#startElement(java.lang.String, java.lang.String, java.lang.String, org.xml.sax.Attributes)
*/
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
if ("inlineStr".equals(name) || "v".equals(name)) {
vIsOpen = true;
// Clear contents cache
value.setLength(0);
}
// c => cell
else if ("c".equals(name)) {
// Get the cell reference
String r = attributes.getValue("r");
int firstDigit = -1;
for (int c = 0; c < r.length(); ++c) {
if (Character.isDigit(r.charAt(c))) {
firstDigit = c;
break;
}
}
thisColumn = nameToColumn(r.substring(0, firstDigit));//获取当前读取的列数
// Set up defaults.
this.nextDataType = xssfDataType.NUMBER;
this.formatIndex = -1;
this.formatString = null;
String cellType = attributes.getValue("t");
String cellStyleStr = attributes.getValue("s");
if ("b".equals(cellType)){
nextDataType = xssfDataType.BOOL;
}else if ("e".equals(cellType)){
nextDataType = xssfDataType.ERROR;
}else if ("inlineStr".equals(cellType)){
nextDataType = xssfDataType.INLINESTR;
}else if ("s".equals(cellType)){
nextDataType = xssfDataType.SSTINDEX;
}else if ("str".equals(cellType)){
nextDataType = xssfDataType.FORMULA;
}else if (cellStyleStr != null) {
// It's a number, but almost certainly one
// with a special style or format
int styleIndex = Integer.parseInt(cellStyleStr);
XSSFCellStyle style = stylesTable.getStyleAt(styleIndex);
this.formatIndex = style.getDataFormat();
this.formatString = style.getDataFormatString();
if (this.formatString == null){
this.formatString = BuiltinFormats.getBuiltinFormat(this.formatIndex);
}
}
}
}
/*
* (non-Javadoc)
* @see org.xml.sax.helpers.DefaultHandler#endElement(java.lang.String, java.lang.String, java.lang.String)
*/
public void endElement(String uri, String localName, String name)
throws SAXException {
String thisStr = null;
// v => contents of a cell
if ("v".equals(name)) {
// Process the value contents as required.
// Do now, as characters() may be called more than once
switch (nextDataType) {
case BOOL:
char first = value.charAt(0);
thisStr = first == '0' ? "FALSE" : "TRUE";
break;
case ERROR:
thisStr = "\"ERROR:" + value.toString() + '"';
break;
case FORMULA:
// A formula could result in a string value,
// so always add double-quote characters.
thisStr = '"' + value.toString() + '"';
break;
case INLINESTR:
// TODO: have seen an example of this, so it's untested.
XSSFRichTextString rtsi = new XSSFRichTextString(value.toString());
thisStr = '"' + rtsi.toString() + '"';
break;
case SSTINDEX:
String sstIndex = value.toString();
try {
int idx = Integer.parseInt(sstIndex);
XSSFRichTextString rtss = new XSSFRichTextString(sharedStringsTable.getEntryAt(idx));
// thisStr = '"' + rtss.toString() + '"';
thisStr = rtss.toString();
}
catch (NumberFormatException ex) {
output.println("Failed to parse SST index '" + sstIndex + "': " + ex.toString());
}
break;
case NUMBER:
String n = value.toString();
if (this.formatString != null){
thisStr = formatter.formatRawCellContents(Double.parseDouble(n), this.formatIndex, this.formatString);
}else{
thisStr = n;
}
break;
default:
thisStr = "(TODO: Unexpected type: " + nextDataType + ")";
break;
}
// Output after we've seen the string contents
// Emit commas for any fields that were missing on this row
/*if (lastColumnNumber == -1) {
lastColumnNumber = 0;
}*/
// 以下是核心算法,在同一行内,若后一次比前一次读取的列序号相差大于1,证明中间没有读到值
// 按照.xlsx底层是xml描述文件原理,此时对应xml中"空值"情况
if(thisColumn - lastColumnNumber > 1){
flag = true ;
}
for (int i = lastColumnNumber; i < thisColumn; ++i){
if(flag && i > lastColumnNumber){
rowlist.add(i, "");
}
}
// Might be the empty string.
rowlist.add(thisColumn, thisStr.trim());
// Update column
if (thisColumn > -1){
lastColumnNumber = thisColumn;
}
} else if ("row".equals(name)) {//读到一行末尾
// Print out any missing commas if needed
if (minColumns > 0) {
// Columns are 0 based
if (lastColumnNumber == -1) {
lastColumnNumber = 0;
}
for (int i = lastColumnNumber; i < (this.minColumnCount); i++) {
output.print("");
}
}
rowReader.getRows(sheetIndex, curRow, rowlist);
rowlist.clear();
curRow++;
flag = false ;
// We're onto a new row
output.println();
lastColumnNumber = -1;
}
}
/**
* Captures characters only if a suitable element is open.
* Originally was just "v"; extended for inlineStr also.
*/
public void characters(char[] ch, int start, int length)
throws SAXException {
if (vIsOpen){
value.append(ch, start, length);
}
}
/**
* Converts an Excel column name like "C" to a zero-based index.
*
* @param name
* @return Index corresponding to the specified name
*/
private int nameToColumn(String name) {
int column = -1;
for (int i = 0; i < name.length(); ++i) {
int c = name.charAt(i);
column = (column + 1) * 26 + c - 'A';
}
return column;
}
public List<String> getRowlist() {
return rowlist;
}
public void setRowlist(List<String> rowlist) {
this.rowlist = rowlist;
}
}
///////////////////////////////////////
private OPCPackage xlsxPackage;
private int minColumns;
private PrintStream output;
// 当前行
private int curRow = 0;
private int sheetIndex = 0;
private IRowReader rowReader;
public void setRowReader(IRowReader rowReader) {
this.rowReader = rowReader;
}
/**
* Creates a new XLSX -> CSV converter
*
* @param pkg The XLSX package to process
* @param output The PrintStream to output the CSV to
* @param minColumns The minimum number of columns to output, or -1 for no minimum
* @param rowReader
*/
public XLSX2CSV(OPCPackage pkg, PrintStream output, int minColumns,IRowReader rowReader) {
this.xlsxPackage = pkg;
this.output = output;
this.minColumns = minColumns;
this.rowReader = rowReader;
}
/**
* Parses and shows the content of one sheet
* using the specified styles and shared-strings tables.
*
* @param styles
* @param strings
* @param sheetInputStream
*/
public void processSheet(
StylesTable styles,
ReadOnlySharedStringsTable strings,
InputStream sheetInputStream)
throws IOException, ParserConfigurationException, SAXException {
InputSource sheetSource = new InputSource(sheetInputStream);
SAXParserFactory saxFactory = SAXParserFactory.newInstance();
SAXParser saxParser = saxFactory.newSAXParser();
XMLReader sheetParser = saxParser.getXMLReader();
ContentHandler handler = new MyXSSFSheetHandler(styles, strings, this.minColumns, this.output);
sheetParser.setContentHandler(handler);
sheetParser.parse(sheetSource);
}
/**
* Initiates the processing of the XLS workbook file to CSV.
*
* @throws IOException
* @throws OpenXML4JException
* @throws ParserConfigurationException
* @throws SAXException
*/
public void process() throws Exception {
InputStream stream = null ;
try {
ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(this.xlsxPackage);
XSSFReader xssfReader = new XSSFReader(this.xlsxPackage);
StylesTable styles = xssfReader.getStylesTable();
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
while (iter.hasNext()) {
curRow = 0;
sheetIndex++;
stream = iter.next();
processSheet(styles, strings, stream);
stream.close();
}
} catch (IllegalArgumentException e) {
e.printStackTrace();
throw new IllegalArgumentException(e.getMessage());
} catch (Exception e) {
e.printStackTrace();
throw new Exception("读取文件失败,此文件可能已损坏,请参照模板重新上传!");
}finally{
if(null != stream){
stream.close();
}
}
}
public static void main(String[] args) throws Exception {
File xlsxFile = new File("C:\\Users\\liuyue\\Desktop\\test1.xlsx");
if (!xlsxFile.exists()) {
System.err.println("Not found or not a file: " + xlsxFile.getPath());
return;
}
int minColumns = 6 ;
// The package open is instantaneous, as it should be.
OPCPackage p = OPCPackage.open(xlsxFile.getPath(), PackageAccess.READ);
IRowReader reader = new RowReaderTest();
XLSX2CSV xlsx2csv = new XLSX2CSV(p, System.out, minColumns ,reader);
xlsx2csv.process();
}
}
XLSX2CSV
java读取大容量excel之二(空格、空值问题)的更多相关文章
- java读取大容量excel之一
最近在用poi读取大容量excel,发现只要是excel文件大于2M左右,便会出现OOM(out of memory),经过查询得知,原来poi读取excel的原理是如下: org.apache.po ...
- java - 读取,导出 excel文件数据
首先需下载poi java包,添加至构建路径, 写处理方法: import java.io.FileInputStream;import java.io.FileOutputStream;import ...
- Java读取批量Excel文件
1.首先基础知识: 原文链接:https://blog.csdn.net/baidu_39298625/article/details/105842725 一 :简介 开发中经常会设计到excel的处 ...
- Java读取txt文件——(二)
Txt数据
- java读取大文件 超大文件的几种方法
java 读取一个巨大的文本文件既能保证内存不溢出又能保证性能 import java.io.BufferedReader; import java.io.File; import jav ...
- Java读取Excel内容
借助于apathe的poi.jar,由于上传文件不支持.jar所以请下载后将文件改为.jar,在应用程序中添加poi.jar包,并将需要读取的excel文件放入根目录即可 本例使用java来读取exc ...
- JAVA 读取excel文件成List<Entity>
package com.fsinfo.common.utils; import com.fsinfo.modules.enterprise.entity.EnterpriseRecordEntity; ...
- Java读取Excel文件的几种方法
Java读取 Excel 文件的常用开源免费方法有以下几种: 1. JDBC-ODBC Excel Driver 2. jxl.jar 3. jcom.jar 4. poi.jar 简单介绍: 百度文 ...
- JAVA读取EXCEL文件异常Unable to recognize OLE stream
异常: jxl.read.biff.BiffException: Unable to recognize OLE stream at jxl.read.biff.CompoundFile.<in ...
随机推荐
- SAP MM移动平均价和标准价逻辑
从收货到领用,S一直都是以标准价格计算,V是实时更新 S 时将差异结转到在产品,产品中,最后结转到生产成本,最终到利润.具有计划性,可以控制考核 V 是实时更新,出现差异直接对应材料中调整.价格可以直 ...
- 编译spock proxy
今天把spock proxy编译通过并且运行了.大家如果在编译这款类似于MySQL proxy的软件遇到问题时,可以联系我.微信onesoft007
- Java GC系列(4):垃圾回收监视和分析
本文由 ImportNew - lomoxy 翻译自 javapapers. 目录 垃圾回收介绍 垃圾回收是如何工作的? 垃圾回收的类别 垃圾回收监视和分析 在这个Java GC系列教程中,让我们学习 ...
- 如何在linux系统下面编译C++(写给小白)(-1)
首先 , 对于redhat,openSuse来说 ,C/C++的编译器已经集成了 大多数应该使用的是Ubuntu ,Ubuntu只有gcc(一个编译C语言的编译器), 因此还需要使用命令apt-get ...
- IO流--切割 合并文件
import java.io.*; import java.util.*; public class io { public static void main(String[] args)throws ...
- 51nod 1065 最小正子段和
题目链接:51nod 1065 最小正子段和 房教说用前缀和做,然后看了别人博客懂了后就感觉,这个真有意思... #include<cstdio> #include<cstring& ...
- jquery添加的html元素按钮为什么不执行类样式绑定的click事件
代码举例: 更多按钮: <input type="button" class="addMore" id="addMore${issue.id } ...
- 《Play for Java》学习笔记(三)template+Message
说明: 这是本书的第八章内容,由于项目需要,提到前面来看啦~~~O(∩_∩)O 一.模板template的定义 Play中的模板是html代码和Scala代码的混合而成的,其中Scala代码以@开头, ...
- Excel VBA记录
-----------快捷键---------- 函数等提示(默认):ctrl+j 注释:上单引号' 设置单元格为空可以用:empty/null -----------基础语法--------- 基本 ...
- Maven 系列 一 :Maven 快速入门及简单使用【转】
开发环境 MyEclipse 2014 JDK 1.8 Maven 3.2.1 1.什么是Maven? Maven是一个项目管理工具,主要用于项目构建,依赖管理,项目信息管理. 2.下载及安装 下载最 ...