spark 2.1.1

spark里执行sql报错

insert overwrite table test_parquet_table select * from dummy

报错如下:

org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:333)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:321)
... 8 more
Caused by: parquet.io.ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:244)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
... 16 more

跟进代码

org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter

    private void writeMap(Object value, MapObjectInspector inspector, GroupType type) {
GroupType repeatedType = type.getType(0).asGroupType();
this.recordConsumer.startGroup();
this.recordConsumer.startField(repeatedType.getName(), 0);
Map<?, ?> mapValues = inspector.getMap(value);
Type keyType = repeatedType.getType(0);
String keyName = keyType.getName();
ObjectInspector keyInspector = inspector.getMapKeyObjectInspector();
Type valuetype = repeatedType.getType(1);
String valueName = valuetype.getName();
ObjectInspector valueInspector = inspector.getMapValueObjectInspector(); for(Iterator i$ = mapValues.entrySet().iterator(); i$.hasNext(); this.recordConsumer.endGroup()) {
Entry<?, ?> keyValue = (Entry)i$.next();
this.recordConsumer.startGroup();
if (keyValue != null) {
Object keyElement = keyValue.getKey();
this.recordConsumer.startField(keyName, 0);
this.writeValue(keyElement, keyInspector, keyType);
this.recordConsumer.endField(keyName, 0);
Object valueElement = keyValue.getValue();
if (valueElement != null) {
this.recordConsumer.startField(valueName, 1);
this.writeValue(valueElement, valueInspector, valuetype);
this.recordConsumer.endField(valueName, 1);
}
}
} this.recordConsumer.endField(repeatedType.getName(), 0);
this.recordConsumer.endGroup();
} private void writeValue(Object value, ObjectInspector inspector, Type type) {
if (type.isPrimitive()) {
this.checkInspectorCategory(inspector, Category.PRIMITIVE);
this.writePrimitive(value, (PrimitiveObjectInspector)inspector);
} else {
GroupType groupType = type.asGroupType();
OriginalType originalType = type.getOriginalType();
if (originalType != null && originalType.equals(OriginalType.LIST)) {
this.checkInspectorCategory(inspector, Category.LIST);
this.writeArray(value, (ListObjectInspector)inspector, groupType);
} else if (originalType != null && originalType.equals(OriginalType.MAP)) {
this.checkInspectorCategory(inspector, Category.MAP);
this.writeMap(value, (MapObjectInspector)inspector, groupType);
} else {
this.checkInspectorCategory(inspector, Category.STRUCT);
this.writeGroup(value, (StructObjectInspector)inspector, groupType);
}
} } private void writePrimitive(Object value, PrimitiveObjectInspector inspector) {
if (value != null) {
switch(inspector.getPrimitiveCategory()) {
case VOID:
return;
case DOUBLE:
this.recordConsumer.addDouble(((DoubleObjectInspector)inspector).get(value));
break;
case BOOLEAN:
this.recordConsumer.addBoolean(((BooleanObjectInspector)inspector).get(value));
break;
case FLOAT:
this.recordConsumer.addFloat(((FloatObjectInspector)inspector).get(value));
break;
case BYTE:
this.recordConsumer.addInteger(((ByteObjectInspector)inspector).get(value));
break;
case INT:
this.recordConsumer.addInteger(((IntObjectInspector)inspector).get(value));
break;
case LONG:
this.recordConsumer.addLong(((LongObjectInspector)inspector).get(value));
break;
case SHORT:
this.recordConsumer.addInteger(((ShortObjectInspector)inspector).get(value));
break;
case STRING:
String v = ((StringObjectInspector)inspector).getPrimitiveJavaObject(value);
this.recordConsumer.addBinary(Binary.fromString(v));
break;
case CHAR:
String vChar = ((HiveCharObjectInspector)inspector).getPrimitiveJavaObject(value).getStrippedValue();
this.recordConsumer.addBinary(Binary.fromString(vChar));
break;
case VARCHAR:
String vVarchar = ((HiveVarcharObjectInspector)inspector).getPrimitiveJavaObject(value).getValue();
this.recordConsumer.addBinary(Binary.fromString(vVarchar));
break;
case BINARY:
byte[] vBinary = ((BinaryObjectInspector)inspector).getPrimitiveJavaObject(value);
this.recordConsumer.addBinary(Binary.fromByteArray(vBinary));
break;
case TIMESTAMP:
Timestamp ts = ((TimestampObjectInspector)inspector).getPrimitiveJavaObject(value);
this.recordConsumer.addBinary(NanoTimeUtils.getNanoTime(ts, false).toBinary());
break;
case DECIMAL:
HiveDecimal vDecimal = (HiveDecimal)inspector.getPrimitiveJavaObject(value);
DecimalTypeInfo decTypeInfo = (DecimalTypeInfo)inspector.getTypeInfo();
this.recordConsumer.addBinary(this.decimalToBinary(vDecimal, decTypeInfo));
break;
case DATE:
Date vDate = ((DateObjectInspector)inspector).getPrimitiveJavaObject(value);
this.recordConsumer.addInteger(DateWritable.dateToDays(vDate));
break;
default:
throw new IllegalArgumentException("Unsupported primitive data type: " + inspector.getPrimitiveCategory());
} }
}

parquet.io.MessageColumnIO.MessageColumnIORecordConsumer

        public void startField(String field, int index) {
try {
if (MessageColumnIO.DEBUG) {
this.log("startField(" + field + ", " + index + ")");
} this.currentColumnIO = ((GroupColumnIO)this.currentColumnIO).getChild(index);
this.emptyField = true;
if (MessageColumnIO.DEBUG) {
this.printState();
} } catch (RuntimeException var4) {
throw new ParquetEncodingException("error starting field " + field + " at " + index, var4);
}
} public void endField(String field, int index) {
if (MessageColumnIO.DEBUG) {
this.log("endField(" + field + ", " + index + ")");
} this.currentColumnIO = this.currentColumnIO.getParent();
if (this.emptyField) {
throw new ParquetEncodingException("empty fields are illegal, the field should be ommited completely instead");
} else {
this.fieldsWritten[this.currentLevel].markWritten(index);
this.r[this.currentLevel] = this.currentLevel == 0 ? 0 : this.r[this.currentLevel - 1];
if (MessageColumnIO.DEBUG) {
this.printState();
} }
} public void addInteger(int value) {
if (MessageColumnIO.DEBUG) {
this.log("addInt(" + value + ")");
} this.emptyField = false;
this.getColumnWriter().write(value, this.r[this.currentLevel], this.currentColumnIO.getDefinitionLevel());
this.setRepetitionLevel();
if (MessageColumnIO.DEBUG) {
this.printState();
} }

DataWritableWriter报错的关键代码是这几行

                Object keyElement = keyValue.getKey();
this.recordConsumer.startField(keyName, 0);
this.writeValue(keyElement, keyInspector, keyType);
this.recordConsumer.endField(keyName, 0);

代码流程梳理如下:

DataWritableWriter.writeMap

MessageColumnIORecordConsumer.startField

注释:this.emptyField = true;

迭代entry

处理key

Object keyElement = keyValue.getKey();

MessageColumnIORecordConsumer.startField

DataWritableWriter.writeValue

DataWritableWriter.isPrimitive

DataWritableWriter.writePrimitive

1)if (value == null) 或是Void

注释:this.emptyField依旧为true

2)if (value != null) MessageColumnIORecordConsumer.addInteger

注释:this.emptyField = false;

MessageColumnIORecordConsumer.endField

MessageColumnIORecordConsumer.endField

注释:if (this.emptyField) {throw new ParquetEncodingException("empty fields are illegal, the field should be ommited completely instead");}

当map<?,?>或array<?>类型的列插入空集合或者map中存在key为null的情形时,就会触发这个错误,

后来发现官方已经有讨论:https://issues.apache.org/jira/browse/HIVE-11625

要避免这个问题有两种方式:

1 改用hive执行sql;

2 增加udf函数filter_map,当map为空集合时置为null,当map不为空集合时过滤掉map值中所有key为null的entry

spark.udf.register("filter_map", ((map : Map[String, String]) => {if (map != null && !map.isEmpty) map.filter(_._1 != null) else null}))

【原创】大叔问题定位分享(15)spark写parquet数据报错ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead的更多相关文章

  1. 【原创】大叔问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat

    spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...

  2. 【原创】大叔问题定位分享(2)spark任务一定几率报错java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT

    最近用yarn cluster方式提交spark任务时,有时会报错,报错几率是40%,报错如下: 18/03/15 21:50:36 116 ERROR ApplicationMaster91: Us ...

  3. 【原创】大叔问题定位分享(12)Spark保存文本类型文件(text、csv、json等)到hdfs时为什么是压缩格式的

    问题重现 rdd.repartition(1).write.csv(outPath) 写文件之后发现文件是压缩过的 write时首先会获取hadoopConf,然后从中获取是否压缩以及压缩格式 org ...

  4. 【原创】大叔问题定位分享(8)提交spark任务报错 Caused by: java.lang.ClassNotFoundException: org.I0Itec.zkclient.exception.ZkNoNodeException

    spark 2.1.1 一 问题重现 spark-submit --master local[*] --class app.package.AppClass --jars /jarpath/zkcli ...

  5. 【原创】大叔问题定位分享(27)spark中rdd.cache

    spark 2.1.1 spark应用中有一些task非常慢,持续10个小时,有一个task日志如下: 2019-01-24 21:38:56,024 [dispatcher-event-loop-2 ...

  6. 【原创】大叔问题定位分享(21)spark执行insert overwrite非常慢,比hive还要慢

    最近把一些sql执行从hive改到spark,发现执行更慢,sql主要是一些insert overwrite操作,从执行计划看到,用到InsertIntoHiveTable spark-sql> ...

  7. 【原创】大叔问题定位分享(19)spark task在executors上分布不均

    最近提交一个spark应用之后发现执行非常慢,点开spark web ui之后发现卡在一个job的一个stage上,这个stage有100000个task,但是绝大部分task都分配到两个execut ...

  8. 【原创】大叔问题定位分享(18)beeline连接spark thrift有时会卡住

    spark 2.1.1 beeline连接spark thrift之后,执行use database有时会卡住,而use database 在server端对应的是 setCurrentDatabas ...

  9. 【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException

    spark查orc格式的数据有时会报这个错 Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc. ...

随机推荐

  1. VS2010创建MVC4项目提示错误: 此模板尝试加载组件程序集 “NuGet.VisualStudio.Interop, Version=1.0.0.0, Culture=neutral,

    在安装VS2010时没有安装MVC4,于是后面自己下载安装了(居然还要安装VS2010 SP1补丁包).装完后新建MVC项目时却提示: 错误: 此模板尝试加载组件程序集 “NuGet.VisualSt ...

  2. 爬取页面InsecureRequestWarning: 警告解决笔记

    InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is s ...

  3. 离线安装IE 11

    转自:http://blog.sina.com.cn/s/blog_711ab1b10102wzq1.html 1.在D盘下,新建文件夹,取名为“ie”. 2.将官网下载的IE11离线包放到此文件夹中 ...

  4. Q&A in Power BI service and Power BI Desktop

    What is Q&A? Sometimes the fastest way to get an answer from your data is to ask a question usin ...

  5. 洛谷P1262间谍网络

    题目 我们首先考虑该题没有环应该怎么做,因为没有环所以是一个DAG,因此直接加上入度为0的罪犯,而有环则可以缩点,之后就成为了DAG,然后用一方法做就好了. \(Code\) #include < ...

  6. Linux基本命令总结(四)

    接上篇: 16,locate 让使用者可以很快速的搜寻档案系统内是否有指定的档案.其方法是先建立一个包括系统内所有档案名称及路径的数据库,之后当寻找时就只需查询这个数据库,而不必实际深入档案系统之中了 ...

  7. SignarL服务器端发送消息给客户端的几种情况

    一.所有连接的客户端 Clients.All.addContosoChatMessageToPage(name, message); 二.只发送给呼叫的客户端(即触发者) Clients.Caller ...

  8. Python并发编程之同步\异步and阻塞\非阻塞

    一.什么是进程 进程: 正在进行的一个过程或者说一个任务.而负责执行任务则是cpu. 进程和程序的区别: 程序仅仅只是一堆代码而已,而进程指的是程序的运行过程. 需要强调的是:同一个程序执行两次,那也 ...

  9. 前向分步算法 && AdaBoost算法 && 提升树(GBDT)算法 && XGBoost算法

    1. 提升方法 提升(boosting)方法是一种常用的统计学方法,在分类问题中,它通过逐轮不断改变训练样本的权重,学习多个分类器,并将这些分类器进行线性组合,提高分类的性能 0x1: 提升方法的基本 ...

  10. Docker:dockerfile自动构建镜像 [六]

    一.手动docker镜像的缺点 相对于手动制作的docker镜像,使用dockerfile构建的镜像有以下优点: 1.dockerfile只有几kb,便于传输 2.使用dockerfile构建出来的镜 ...