快速去除 UTF-8 BOM

工作中多多少少都会遇到 UTF-8 BOM（后面直接叫 BOM），有时第三方工具不支持就要自己去掉 BOM，例如阿里云导出的 SQL 文件是有 BOM 的，但是 Navicat 不支持，这就要去掉 BOM 了。

后文所用的测试文件是一个阿里云导出的 SQL 文件，265M，测试时文件已缓存（time 显示的 File system inputs 接近 0）

用 sed 去 BOM

sed - e '1s/^\xef\xbb\xbf//'file

用 time 看一下 sed 方法耗时：

$ /usr/bin/time -v sed -e '1s/^\xef\xbb\xbf//' sqlResult_1601835.sql > /dev/null
        ...
        User time (seconds): 0.33
        System time (seconds): 0.11
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.46
        ...

User time 较大，因为 sed 会对每一行都进行处理，但是实际上只有第一行有 BOM，所以浪费了 CPU。

sed 还支持原地更新（-i）：

$ /usr/bin/time -v sed -e '1s/^\xef\xbb\xbf//' sqlResult_1601835.sql -i
        ...
        User time (seconds): 1.31
        System time (seconds): 3.89
        Percent of CPU this job got: 71%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.32
        ...

因为会写入文件，所以会更慢，用 strace 可以发现，sed 是通过输出到临时文件然后覆盖原文件实现更新的

open("sqlResult_1601835.sql", O_RDONLY) = 3
open("./sedGlXm60", O_RDWR|O_CREAT|O_EXCL, 0600) = 4
...
rename("./sedGlXm60", "sqlResult_1601835.sql")

用 tail 去 BOM

tail--bytes = +4 file

用 tail 可以直接跳过 BOM，然后直接复制文件内容，减少了不必要的 CPU 处理：

$ /usr/bin/time -v tail --bytes=+4 sqlResult_1601835.sql > /dev/null
        ...
        User time (seconds): 0.01
        System time (seconds): 0.12
        Percent of CPU this job got: 96%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.14
        ...

但是 tail 必须自己重定向到新文件再覆盖旧文件。

strip-bom

为了结合 sed 和 tail 的优点，我写了一个 strip-bom ，支持原地更新文件。

先测试一下重定向：

$ /usr/bin/time -v php strip-bom.phar sqlResult_1601835.sql > /dev/null
        ...
        User time (seconds): 0.11
        System time (seconds): 0.22
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.35
        ...

只比 sed 快了 20%，User time 少了但 System time 增加了。因为是个循环读写的过程，每次循环就是一次 read 和 write 调用，所以我增加了一个参数来调节每次读的块大小，可以减少循环次数和系统调用，可以比 sed 快 60%：

$ /usr/bin/time -v php strip-bom.phar -b 16384 sqlResult_1601835.sql > /dev/null
        ...
        User time (seconds): 0.06
        System time (seconds): 0.12
        Percent of CPU this job got: 96%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.19

测试原地更新，比 sed 快 30%：

$ /usr/bin/time -v php strip-bom.phar -i -b 16384 sqlResult_1601835.sql
        User time (seconds): 0.23
        System time (seconds): 0.67
        Percent of CPU this job got: 17%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.11

copy_file_range

Linux 4.5 增加了一个系统调用：

ssize_t copy_file_range(int fd_in, loff_t *off_in,
                               int fd_out, loff_t *off_out,
                               size_t len, unsigned int flags);

可以直接在两个文件描述符间复制内容，而且通常只要一个系统调用，所以可以参考 sed 复制到临时文件，然后覆盖旧文件，实现代码在： Gist

测试：

$ /usr/bin/time -v ./copy_file_range sqlResult_1601835.sql
        ...
        User time (seconds): 0.00
        System time (seconds): 2.47
        Percent of CPU this job got: 37%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.52

减少了系统调用也只比 sed 快一点，复制到临时文件还是比 strip-bom 原地更新慢。

dos2unix 去 BOM

一直以为 dos2unix 就是转 CRLF 的，看 Feng_Yu 评论之后看了 man page，原来 dos2unix 功能很多，其中有去 BOM 的选项（-r）：

$ /usr/bin/time -v dos2unix -r sqlResult_1601835.sql
dos2unix: 正在转换文件 sqlResult_1601835.sql 为Unix格式...
        Command being timed: "dos2unix -r sqlResult_1601835.sql"
        User time (seconds): 10.01
        System time (seconds): 0.90
        Percent of CPU this job got: 60%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.20

dos2unix 实现类似 sed，也是写到临时文件再覆盖，也和 sed 一样，会处理每一行，所以性能并不好。

来源: https://segmentfault.com/a/1190000012621180

与本文相关文章

暂无,快来抢沙发吧！