cut-trailing-bytes：二进制尾部去0小工具

背景

之前的文章二进制文件处理之尾部补0和尾部去0 中介绍了一种使用 sed 去除二进制文件尾部的 NULL(十六进制0x00)字节的方法。

最近发现这种方法有局限性，无法处理较大的文件。因为 sed 本身是行处理，几百M的二进制文件对 sed 而言就是一个几百M的行，这超出了 sed 的最大限制。

具体的限制条件没有去探究，好像有的版本是硬编码了一个上限，有的版本是取决于可申请的内存。

总之，sed 搞不定了，必须另寻他法。

其实我一直相信有现成的工具可以做到，但在有限的时间内没能找到，就只有自己先写一个应应急了。如有人知道简单的办法，可以指教下。

如果只是需要工具，后文就可以略过了，源码在此：https://github.com/zqb-all/cut-trailing-bytes

思路

这个想想倒也简单，就是找到文件最后一个非 0x00 的字节，并把文件从此处截断。

看起来从后往前找效率会高点。但从前往后找简单些，先从最简单的实现开始吧。

实现过程

用 c 写也很简单，但春节期间看了点 rust 的语法，忍不住想试试。

到 rust 文档中搜了下，https://doc.rust-lang.org/std/io/trait.Read.html#method.bytes 给了一个例子

use std::io;

use std::io::prelude::*;

use std::fs::File;

fn main() -> io::Result<()> {

    let mut f = File::open("foo.txt")?;

    for byte in f.bytes() {

        println!("{}", byte.unwrap());

    }

    Ok(())

}

另外找到有个 set_len 直接可以截断文件。

于是第一个版本，照猫画虎很快就出来了。

use std::io;

use std::io::prelude::*;

use std::fs::File;

use std::fs::OpenOptions;

fn main() -> io::Result<()> {

    let mut f = File::open("foo.txt")?;

    let mut total_len = 0;

    let mut tmp_len = 0;

    for byte in f.bytes() {

        match byte.unwrap() {

            0 => { tmp_len += 1; }

            _ => {

                total_len += tmp_len;

                tmp_len = 0;

                total_len += 1;

                }

        }

    }

    println!("total_len:{},tmp_len:{}", total_len, tmp_len);

    let mut f = OpenOptions::new().write(true).open("foo.txt");

    f.unwrap().set_len(total_len)?;

    Ok(())

}

弄了个小文件测试下没问题，那么上大文件试试。

一步到位造个 500M 的文件，结果发现运行之后就卡住，看来这个 f.bytes() ，就跟控制台执行 dd 时指定 bs=1 一样，都是超低效率了。

本来想等等看到底需要多久的，但洗完一个澡回来发现还卡着，就放弃了，直接开始改吧。

改成先读到 buffer 中，再对 buffer 进行逐 byte 判断。

    ...

    let mut buffer = [0; 4096];

    loop {

           let n = f.read(&mut buffer[..])?;

           if n == 0 { break; }

           for byte in buffer.bytes() {

               match byte.unwrap() {

    ...

效率一下变高了，500M 的文件在我 win10 WSL 中几秒钟就可以跑完。

再改改，把硬编码的 foo.txt 换成从参数中获取文件名，并在 buffer 的处理循环中补上对 n 做递减，递减到 0 就 break，以正确处理最后一笔数据填不满 buffer 的边界情况。

use std::io;

use std::io::prelude::*;

use std::fs::File;

use std::fs::OpenOptions;

use std::env;

fn main() -> io::Result<()> {

    let args: Vec<String> = env::args().collect();

    let filename = &args[1];

    let mut f = File::open(filename)?;

    let mut total_len = 0;

    let mut tmp_len = 0;

    let mut buffer = [0; 4096];

    loop {

        let mut n = f.read(&mut buffer[..])?;

        if n == 0 { break; }

        for byte in buffer.bytes() {

            match byte.unwrap() {

                0 => { tmp_len += 1; }

                _ => {

                    total_len += tmp_len;

                    tmp_len = 0;

                    total_len += 1;

                }

            }

            n -= 1;

            if n == 0 { break; }

        }

    }

    println!("len:{}", total_len);

    let f = OpenOptions::new().write(true).open(filename);

    f.unwrap().set_len(total_len)?;

    println!("done");

    Ok(())

}

最终版本

上一章的版本其实对我来说暂时够用了，进一步的优化就懒得做了。

周末又完善了一下。改用了 structopt 来处理参数，并支持通过参数指定要裁剪的值。也就是不仅可以用来去除末尾的0x00，也可以指定其他值，例如0xFF。

源码不贴了，有需要 github 自取：https://github.com/zqb-all/cut-trailing-bytes

有了参数处理，help 看起来就要舒服多了。

$ cut-trailing-bytes --help

cut-trailing-bytes 0.1.0

A tool for cut trailing bytes, default cut trailing NULL bytes(0x00 in hex)

USAGE:

    cut-trailing-bytes [FLAGS] [OPTIONS] <file>

FLAGS:

    -d, --dry-run    Check the file but don't real cut it

    -h, --help       Prints help information

    -V, --version    Prints version information

OPTIONS:

    -c, --cut-byte <byte-in-hex>    For example, pass 'ff' if want to cut 0xff [default: 0]

ARGS:

    <file>    File to cut

看下效果

$ echo "hello" > hello_00

$ dd if=/dev/zero bs=1k count=1 >> hello_00

1+0 records in

1+0 records out

1024 bytes (1.0 kB, 1.0 KiB) copied, 0.0031857 s, 321 kB/s

$ hexdump -C hello_00

00000000  68 65 6c 6c 6f 0a 00 00  00 00 00 00 00 00 00 00  |hello...........|

00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

*

00000406

$ du -b hello_00

1030    hello_00

$ ./cut-trailing-bytes hello_00

cut hello_00 from 1030 to 6

$ hexdump -C hello_00

00000000  68 65 6c 6c 6f 0a                                 |hello.|

00000006

$ du -b hello_00

6       hello_00

$ echo "hello" > hello_ff

$ dd if=/dev/zero bs=1k count=1 | tr '\000' '\377' >> hello_ff

1+0 records in

1+0 records out

1024 bytes (1.0 kB, 1.0 KiB) copied, 0.0070723 s, 145 kB/s

$ hexdump -C hello_ff

00000000  68 65 6c 6c 6f 0a ff ff  ff ff ff ff ff ff ff ff  |hello...........|

00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|

*

00000406

$ du -b hello_ff

1030    hello_ff

$ ./cut-trailing-bytes hello_ff -c ff

cut hello_ff from 1030 to 6

$ hexdump -C hello_ff

00000000  68 65 6c 6c 6f 0a                                 |hello.|

00000006

zqb@WSL:~/workspace/rust/cut-trailing-bytes

$ du -b hello_ff

6       hello_ff

题外话

rust 编译出来的可执行文件还是挺大的，后来发现改用 nightly 版本会小很多, 再做一次 strip 就更小了。

最终是上面贴出源码的版本，strip 后的 release 版本是 200+ k, 而 github 上完善了参数处理等的版本则要 700+ k。

公众号：https://sourl.cn/qVmBKh

Blog: https://www.cnblogs.com/zqb-all/p/12641329.html