问题代码
- float32_t Sum_float(float32_t *data, const int count)
- {
- float32x4_t res = vdupq_n_f32(0.0f);
- for(int i = 0; i <(count & (~15)); i += 16)
- {
- #if 01
- float32x4x4_t v0 = vld1q_f32_x4(data + i);
- float32x4_t v00 = v0.val[0];
- float32x4_t v01 = v0.val[1];
- float32x4_t v02 = v0.val[2];
- float32x4_t v03 = v0.val[3];
- #else
- float32x4_t v00 = vld1q_f32(data + i);
- float32x4_t v01 = vld1q_f32(data + i + 4);
- float32x4_t v02 = vld1q_f32(data + i + 8);
- float32x4_t v03 = vld1q_f32(data + i + 12);
- #endif
- v00 = vaddq_f32(v00, v02);
- v01 = vaddq_f32(v01, v03);
- res = vaddq_f32(res, vaddq_f32(v00, v01));
- }
- float32x2_t res1 = vadd_f32(vget_low_f32(res), vget_high_f32(res));
- float32_t v0[2];
- vst1_f32(v0, res1);
- v0[0] += v0[1];
- for(int i = count & (~15); i < count; ++i){
- v0[0] += data[i];
- }
- return v0[0];
- }
编译测试
首先, 查阅了, 对于 vld1q_f32_x4 这个指令, v7/A32/A64 都是支持的.
不同编译器版本结果: 首先, 对于所有的版本, 如果使用 #else 块的代码, 都是可以编译成功的, 对于使用 #if 01 块的代码, 结果如下:
armeabi-v7a with o1 | armeabi=v7a with o0 | arm64-v8a | |
---|---|---|---|
r20c | clang++: error: clang frontend command failed due to signal (use -v to see invocation) | ok | ok |
r19c | ok | ok | ok |
r15c | error: use of undeclared identifier 'vld1q_f32_x4' | error: use of undeclared identifier 'vld1q_f32_x4' | ok |
不仅仅 vld1q_f32_x4, 对于 vld1_u8_x2;vst1q_f32_x4 等类似指令都存在这样的问题.
性能对比
测试代码:
- int main()
- {
- const size_t len = 1024*1024 * 16;
- float32_t *data = new float32_t[len];
- for(size_t i = 0; i < len; ++i) {
- data[i] = std::rand() / 100.0;
- }
- clock_t t0 = std::clock();
- float32_t sum = Sum_float(data, len);
- printf("sum=%f , time cost=%f \n", sum, 1000.0 * (double)(std::clock() - t0) / CLOCKS_PER_SEC);
- return 0;
- }
测试了使用三种 NDK 版本编译 arm64-v8a 测试, 同时使用 r19c 编译了 armeabi-v7a, 分别使用 #if 和 #else 分之, 发现耗时都是在 3.55ms 左右, 无明显差别.
类似问题:
地址对齐
虽然使用 r19c 的版本编译 armeabi-v7a 成功, 或者使用不优化的 r20c 也一样, 但是执行时发生了 crash. 原因是执行 vldN(q)_type_xN 指令时, 地址不对齐导致的 crash.
而对于 arm64-v8a 版本, 把所有传给 vldN(q)_type_xN 的地址打印出来, 同样发现也有 0x7350800001 这样的地址, 而且地址末位为 0 到 E 的都有, 但是却没有报错. 也即, 对于该指令只有 armeabi-v7a 有地址对齐要求, 而 arm64-v8a 却没有?
同时, 常规的 vldN(q)_type 指令则没有地址对齐的要求, 所以最好不要使用 vldN(q)_type_xN.
在代码中因为地址对齐而导致的 crash 日志:
- libc : Fatal signal 7 (SIGBUS), code 1 (BUS_ADRALN), fault addr 0xf0900001 in tid 27659 (ClarityOpt), pid 27659 (ClarityOpt)
- crash_dump32: obtaining output fd from tombstoned, type: kDebuggerdTombstone
- crash_dump32: performing dump of process 27659 (target tid = 27659)
- DEBUG : Process name is /data/local/tmp/ClarityOpt, not key_process
- DEBUG : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
- DEBUG : Build fingerprint: 'OPPO/PCCM00/OP4A7A:10/QKQ1.191222.002/1584699103:user/release-keys'
- DEBUG : Revision: '0'
- DEBUG : ABI: 'arm'
- DEBUG : Timestamp: 2020-05-09 15:15:16+0800
- DEBUG : pid: 27659, tid: 27659, name: ClarityOpt>>> /data/local/tmp/ClarityOpt <<<
- DEBUG : uid: 0
- crash_dump32: type=1400 audit(0.0:27044): avc: denied {
- read
- } for name="ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1
- crash_dump32: type=1400 audit(0.0:27045): avc: denied {
- open
- } for path="/data/local/tmp/ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1
- crash_dump32: type=1400 audit(0.0:27046): avc: denied {
- getattr
- } for path="/data/local/tmp/ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1
- crash_dump32: type=1400 audit(0.0:27047): avc: denied {
- map
- } for path="/data/local/tmp/ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1
- DEBUG : signal 7 (SIGBUS), code 1 (BUS_ADRALN), fault addr 0xf0900001
- DEBUG : r0 00000043 r1 00000000 r2 a9a5ac6f r3 00000003
- DEBUG : r4 f0900001 r5 ffcb0a00 r6 ffcb0a40 r7 ffcb0b60
- DEBUG : r8 f0900007 r9 00000001 r10 f0900000 r11 f0900000
- DEBUG : ip ffcb0500 sp ffcb09f0 lr 00000004 pc 021d265e
- DEBUG :
- DEBUG : backtrace:
- DEBUG : #00 pc 0000365e /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
- DEBUG : #01 pc 00004271 /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
- DEBUG : #02 pc 00004c9f /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
- DEBUG : #03 pc 00004dd3 /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
- DEBUG : #04 pc 000513bb /apex/com.Android.runtime/lib/bionic/libc.so (__libc_init+66) (BuildId: 8e41d0dce7911ae25a51deb63aa9720c)
- DEBUG : #05 pc 00002a98 /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
来源: https://www.cnblogs.com/willhua/p/12858725.html