工作采坑札记：1. Hadoop中的BytesWritable误区

1. 背景

近日帮外部门的同事处理一个小需求，就是将HDFS中2018年至今所有存储的sequence序列化文件读取出来，重新保存成文本格式，以便于他后续进行处理。由于同事主要做机器学习方向，对hadoop或spark方面不了解，所以我就想着这么小的需求，简单支持下即可，花个几分钟写了一个脚本提供给他，没想到，过了一天他又找到我，说脚本读取出来的文件大部分有问题…原来自己代码有bug

2. 初始版本

Spark或Hadoop读取sequence文件只需调用相应函数即可。

第一版本的spark程序代码如下：

 package com.ws.test

 import org.apache.hadoop.io.{BytesWritable, Text}

 import org.apache.spark.{SparkConf, SparkContext}

 object Test {

   def main(args: Array[String]): Unit = {

     if (args.length < 1) {

       println("input param error: <method name>")

     } else {

       args(0) match {

         case "deseqData" => deseqData(args)

         case _ =>

       }

     }

   }

   def deseqData(args: Array[String]): Unit ={

     if(args.length != 3){

       println("input param error: <method name> <input dir> <output dir>")

       return

     }

     val conf = new SparkConf()

     conf.setAppName(args(0))

     val sc = new SparkContext(conf)

     val inputDir = args(1)

     val outputDir = args(2)

     sc.sequenceFile[Text, BytesWritable](s"hdfs://$inputDir")

       .map(data => new String(data._2.getBytes)).saveAsTextFile(outputDir)

     sc.stop()

   }

 }

提供的bash脚本如下：

#!/bin/bash

source ~/.bashrc

if [[ $# -ne  ]];then

    echo "$(date +'%Y-%m-%d %H:%M:%S')    input param error: <file input path> <file output path>"

    exit

fi

SPARK_BIN_DIR=/home/hadoop/spark/spark-1.5.-bin-hadoop2./bin

HADOOP_BIN_PATH=/home/hadoop/hadoop-2.3.-cdh5.0.0/bin

$HADOOP_BIN_PATH/hadoop fs -rm -r $

if [ $? -ne ];then

    echo "$(date +'%Y-%m-%d %H:%M:%S')    output file dir does not exist in hdfs"

fi

runJob(){

    echo "$(date +'%Y-%m-%d %H:%M:%S')    spark task begins!"

    nohup $SPARK_BIN_DIR/spark-submit --class com.ws.test.Test  --master yarn --num-executors  --driver-memory 7192M --executor-memory 7192M --queue default run.jar deseqData $ $ >> log >& &

    if [ $? -ne  ];then

        echo "$(date +'%Y-%m-%d %H:%M:%S')    spark task running error"

        exit

    fi

    pid=$!

    echo "$(date +'%Y-%m-%d %H:%M:%S')    spark task processId is $pid, wait to finish..."

    wait $pid

    if [ $? -ne  ];then

        echo  "$(date +'%Y-%m-%d %H:%M:%S')    spark task running exception"

        exit

    fi

    tail -f log

    echo "$(date +'%Y-%m-%d %H:%M:%S')    spark task  finished!"

}

runJob $ $

执行./run.sh /crawler/data/2018-04-04/0/data1522774524.799569.seq /crawler/wstest执行解析任务即可

提取的结果如下:

那些@符号是什么鬼…

3. 优化版本

由于导出的数据存在问题，遂优化了一版，不同之处如下：

 sc.sequenceFile[Text, BytesWritable](s"hdfs://$inputDir")

   //      .map(data => new String(data._2.getBytes))

   .map(data => {

   val value = data._2

   value.setCapacity(value.getLength)

   new String(value.getBytes)

 }).saveAsTextFile(outputDir)

打包重新运行bash脚本，得到的结果如下图：

终于正常了，长舒一口气…，but为什么这样呢？

4. 原因分析

当你把byte[]数据保存为BytesWritable后，通过BytesWritable.getBytes()再获取到的数据不一定是原数据，可能变长了很多，这是因为BytesWritable采用自动内存增长算法，你保存数据长度为size时，它可能将数据保存到了长度为capacity(capacity>size)的buffer中。此时，使用BytesWritable.getBytes()得到的数据最后一些字符是多余的。如果里面保存的是protocol buffer序列化后的字符串，则将无法反序列化。

此时可以使用BytesWritable.setCapacity(bytesWritable.getLength())将后面多于空间剔除。