前言
写一手漂亮的代码, 何谓漂亮的代码? 对我来说大概有这么几点:
写法符合规范(如: 该空格的地方打上空格, 该换行的地方换行, 名命方式符合规范等等)
简洁且可读性高(能十行代码实现并且让人容易看懂的绝不写十一行, 对经常重复出现的代码段落进行封装)
性能高(如: 运行时间尽可能短, 运行时所用内存尽可能少)
要实现以上目标, 自然就要对代码进行优化, 说到代码的优化, 自然而然就会想到对算法时间复杂度进行优化, 比如我要实现一个在有序数组中查找一个数, 最容易想到的就是遍历一遍 O(n) 的复杂度, 优化一下自然是使用二分, O(logn) 的复杂度. 如果这段代码在我们的程序中会经常被调用, 那么, 通过这算法上的优化, 我们的程序性能自然而然的会有很高的提升.
但是, 有时候会发现, 已经对算法进行优化了, 程序的性能 (如运行时间, 内存占用等) 仍然不能达到预期, 那么, 这时候该如何对我们的代码进行进一步的优化呢?
这篇文章将以 Python 为例进行介绍
先来段代码
这里, 我将通过使用 Julia 分形的代码来进行.
Julia 集合, 由式 \(f_c(z) = z ^2 + c\) 进行反复迭代到.
对于固定的复数 c , 取某一 z 值, 可以得到序列
\(z_0, f_c(z_0), f_c(f_c(z_0)), ...\)
这一序列可能发散于无穷大或处于某一范围之内并收敛于某一值, 我们将使其不扩散的 z 值的集合称为朱利亚集合.
- import time
- import numpy as np
- import imageio
- import PIL
- import matplotlib.pyplot as plt
- import cv2 as cv
- x1, x2, y1, y2 = -1.8, 1.8, -1.8, 1.8
- c_real, c_imag = -0.62772, -0.42193
- def calculate_z_serial_purepython(maxiter, zs, cs):
- output = [0] * len(zs)
- for i in range(len(zs)):
- n = 0
- z = zs[i]
- c = cs[i]
- while abs(z) <2 and n < maxiter:
- z = z * z + c
- n += 1
- output[i] = n
- return output
- def calc_pure_python(desired_width, max_itertions):
- x_step = (float(x2 - x1)) / float(desired_width)
- y_step = (float(y2 - y1)) / float(desired_width)
- x, y = [], []
- ycoord = y1
- while ycoord < y2:
- y.append(ycoord)
- ycoord += y_step
- xcoord = x1
- while xcoord < x2:
- x.append(xcoord)
- xcoord += x_step
- zs, cs = [], []
- for ycoord in y:
- for xcoord in x:
- zs.append(complex(xcoord, ycoord))
- cs.append(complex(c_real, c_imag))
- print(f"Length of x: {len(x)}")
- print(f"Total elements: {len(zs)}")
- start_time = time.time()
- output = calculate_z_serial_purepython(max_itertions, zs, cs)
- end_time = time.time()
- secs = end_time - start_time
- print("calculate_z_serial_purepython took", secs, "seconds")
- assert sum(output) == 33219980
- # # show img
- # output = np.array(output).reshape(desired_width, desired_width)
- # plt.imshow(output, cmap='gray')
- # plt.savefig("julia.png")
- if __name__ == "__main__":
- calc_pure_python(desired_width=1000, max_itertions=300)
这段代码运行完, 可以得到图片
运行结果
- Length of x: 1000
- Total elements: 1000000
- calculate_z_serial_purepython took 25.053941249847412 seconds
开始分析
这里, 将通过各种方法来对这段代码的运行时间来进行分析
直接打印运行时间
在前面的代码中, 我们可以看到有 start_time 和 end_time 两个变量, 通过 print 两个变量的差值即可得到运行时间, 但是, 每次想要打印运行时间都得加那么几行代码就会很麻烦, 此时我们可以通过使用修饰器来进行
- from functools import wraps
- def timefn(fn):
- @wraps(fn)
- def measure_time(*args, **kwargs):
- start_time = time.time()
- result = fn(*args, **kwargs)
- end_time = time.time()
- print("@timefn:" + fn.__name__ + "took" + str(end_time - start_time), "seconds")
- return result
- return measure_time
然后对 calculate_z_serial_purepython 函数进行测试
- @timefn
- def calculate_z_serial_purepython(maxiter, zs, cs):
- ...
运行后输出结果
- Length of x: 1000
- Total elements: 1000000
- @timefn:calculate_z_serial_purepython took 26.64286208152771 seconds
- calculate_z_serial_purepython took 26.64286208152771 seconds
另外, 也可以在命令行中输入
python -m timeit -n 5 -r 5 -s "import code" "code.calc_pure_python(desired_width=1000, max_itertions=300)"
其中 -n 5 表示循环次数, -r 5 表示重复次数, timeit 会对语句循环执行 n 次, 并计算平均值作为一个结果, 重复 r 次选出最好的结果.
5 loops, best of 5: 24.9 sec per loop
UNIX tine 命令
由于电脑上没有 Linux 环境, 于是使用 WSL 来进行
time -p python code.py
如果是 Linux 中进行, 可能命令需改成
/usr/bin/time -p python code.py
输出结果
- Length of x: 1000
- Total elements: 1000000
- @timefn:calculate_z_serial_purepython took 14.34933090209961 seconds
- calculate_z_serial_purepython took 14.350624322891235 seconds
- real 15.57
- user 15.06
- sys 0.40
其中 real 记录整体耗时, user 记录了 CPU 花在任务上的时间, sys 记录了内核函数耗费的时间
/usr/bin/time --verbose python code.py
输出, WSL 的 time 命令里面没有 --verbose 这个参数, 只能到服务器里面试了, 突然觉得我的笔记本跑的好慢...
- Length of x: 1000
- Total elements: 1000000
- @timefn:calculate_z_serial_purepython took 7.899603605270386 seconds
- calculate_z_serial_purepython took 7.899857997894287 seconds
- Command being timed: "python code.py"
- User time (seconds): 8.33
- System time (seconds): 0.08
- Percent of CPU this job got: 98%
- Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.54
- Average shared text size (kbytes): 0
- Average unshared data size (kbytes): 0
- Average stack size (kbytes): 0
- Average total size (kbytes): 0
- Maximum resident set size (kbytes): 98996
- Average resident set size (kbytes): 0
- Major (requiring I/O) page faults: 0
- Minor (reclaiming a frame) page faults: 25474
- Voluntary context switches: 0
- Involuntary context switches: 2534
- Swaps: 0
- File system inputs: 0
- File system outputs: 0
- Socket messages sent: 0
- Socket messages received: 0
- Signals delivered: 0
- Page size (bytes): 4096
- Exit status: 0
这里面需要关心的参数是 Major (requiring I/O) page faults , 表示操作系统是否由于 RAM 中的数据不存在而需要从磁盘上读取页面.
cProfile 模块
cProfile 模块是标准库内建三个的分析工具之一, 另外两个是 hotshot 和 profile.
python -m cProfile -s cumulative code.py
-s cumulative 表示对每个函数累计花费的时间进行排序
输出
- 36222017 function calls in 30.381 seconds
- Ordered by: cumulative time
- ncalls tottime percall cumtime percall filename:lineno(function)
- 1 0.000 0.000 30.381 30.381 {built-in method builtins.exec}
- 1 0.064 0.064 30.381 30.381 code.py:1(<module>)
- 1 1.365 1.365 30.317 30.317 code.py:35(calc_pure_python)
- 1 0.000 0.000 28.599 28.599 code.py:13(measure_time)
- 1 19.942 19.942 28.598 28.598 code.py:22(calculate_z_serial_purepython)
- 34219980 8.655 0.000 8.655 0.000 {built-in method builtins.abs}
- 2002000 0.339 0.000 0.339 0.000 {method 'append' of 'list' objects}
- 1 0.012 0.012 0.012 0.012 {built-in method builtins.sum}
- 4 0.003 0.001 0.003 0.001 {built-in method builtins.print}
- 1 0.000 0.000 0.000 0.000 code.py:12(timefn)
- 1 0.000 0.000 0.000 0.000 functools.py:44(update_wrapper)
- 4 0.000 0.000 0.000 0.000 {built-in method time.time}
- 1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:989(_handle_fromlist)
- 4 0.000 0.000 0.000 0.000 {built-in method builtins.len}
- 7 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
- 1 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
- 5 0.000 0.000 0.000 0.000 {built-in method builtins.setattr}
- 1 0.000 0.000 0.000 0.000 functools.py:74(wraps)
- 1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
- 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
可以看到, 在代码的入口处总共花费了 30.381 秒, ncalls 为 1, 表示只执行了 1 次, 然后 calculate_z_serial_purepython 花费了 28.598 秒, 可以推断出调用该函数使用了近 2 秒. 另外可以看到, abs 函数被调用了 34219980 次. 对列表项的 append 操作进行了 2002000 次(1000 * 1000 * 2 +1000 * 2 ).
接下来, 我们进行更深入的分析.
python -m cProfile -o profile.stats code.py
先生成一个统计文件, 然后在 python 中进行分析
- >>> import pstats
- >>> p = pstats.Stats("profile.stats")
- >>> p.sort_stats("cumulative")
- <pstats.Stats object at 0x000002AA0A6A8908>
- >>> p.print_stats()
- Sat Apr 25 16:38:07 2020 profile.stats
- 36222017 function calls in 30.461 seconds
- Ordered by: cumulative time
- ncalls tottime percall cumtime percall filename:lineno(function)
- 1 0.000 0.000 30.461 30.461 {built-in method builtins.exec}
- 1 0.060 0.060 30.461 30.461 code.py:1(<module>)
- 1 1.509 1.509 30.400 30.400 code.py:35(calc_pure_python)
- 1 0.000 0.000 28.516 28.516 code.py:13(measure_time)
- 1 20.032 20.032 28.515 28.515 code.py:22(calculate_z_serial_purepython)
- 34219980 8.483 0.000 8.483 0.000 {built-in method builtins.abs}
- 2002000 0.360 0.000 0.360 0.000 {method 'append' of 'list' objects}
- 1 0.012 0.012 0.012 0.012 {built-in method builtins.sum}
- 4 0.004 0.001 0.004 0.001 {built-in method builtins.print}
- 1 0.000 0.000 0.000 0.000 code.py:12(timefn)
- 1 0.000 0.000 0.000 0.000 C:\Users\ITryagain\AppData\Local\conda\conda\envs\tensorflow-gpu\lib\functools.py:44(update_wrapper)
- 4 0.000 0.000 0.000 0.000 {built-in method time.time}
- 1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:989(_handle_fromlist)
- 7 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
- 1 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
- 4 0.000 0.000 0.000 0.000 {built-in method builtins.len}
- 1 0.000 0.000 0.000 0.000 C:\Users\ITryagain\AppData\Local\conda\conda\envs\tensorflow-gpu\lib\functools.py:74(wraps)
- 1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
- 5 0.000 0.000 0.000 0.000 {built-in method builtins.setattr}
- 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
- <pstats.Stats object at 0x000002AA0A6A8908>
这里, 就生成了与上面一致的信息
- >>> p.print_callers()
- Ordered by: cumulative time
- Function was called by...
- ncalls tottime cumtime
- {built-in method builtins.exec} <-
- code.py:1(<module>) <- 1 0.060 30.461 {built-in method builtins.exec}
- code.py:35(calc_pure_python) <- 1 1.509 30.400 code.py:1(<module>)
- code.py:13(measure_time) <- 1 0.000 28.516 code.py:35(calc_pure_python)
- code.py:22(calculate_z_serial_purepython) <- 1 20.032 28.515 code.py:13(measure_time)
- {built-in method builtins.abs} <- 34219980 8.483 8.483 code.py:22(calculate_z_serial_purepython)
- {method 'append' of 'list' objects} <- 2002000 0.360 0.360 code.py:35(calc_pure_python)
- {built-in method builtins.sum} <- 1 0.012 0.012 code.py:35(calc_pure_python)
- {built-in method builtins.print} <- 1 0.000 0.000 code.py:13(measure_time)
- 3 0.003 0.003 code.py:35(calc_pure_python)
- code.py:12(timefn) <- 1 0.000 0.000 code.py:1(<module>)
- C:\Users\ITryagain\AppData\Local\conda\conda\envs\tensorflow-gpu\lib\functools.py:44(update_wrapper) <- 1 0.000 0.000 code.py:12(timefn)
- {built-in method time.time} <- 2 0.000 0.000 code.py:13(measure_time)
- 2 0.000 0.000 code.py:35(calc_pure_python)
- <frozen importlib._bootstrap>:989(_handle_fromlist) <- 1 0.000 0.000 code.py:1(<module>)
- {built-in method builtins.getattr} <- 7 0.000 0.000 C:\Users\ITryagain\AppData\Local\conda\conda\envs\tensorflow-gpu\lib\functools.py:44(update_wrapper)
- {built-in method builtins.hasattr} <- 1 0.000 0.000 <frozen importlib._bootstrap>:989(_handle_fromlist)
- {built-in method builtins.len} <- 2 0.000 0.000 code.py:22(calculate_z_serial_purepython)
- 2 0.000 0.000 code.py:35(calc_pure_python)
- C:\Users\ITryagain\AppData\Local\conda\conda\envs\tensorflow-gpu\lib\functools.py:74(wraps) <- 1 0.000 0.000 code.py:12(timefn)
- {method 'update' of 'dict' objects} <- 1 0.000 0.000 C:\Users\ITryagain\AppData\Local\conda\conda\envs\tensorflow-gpu\lib\functools.py:44(update_wrapper)
- {built-in method builtins.setattr} <- 5 0.000 0.000 C:\Users\ITryagain\AppData\Local\conda\conda\envs\tensorflow-gpu\lib\functools.py:44(update_wrapper)
- {method 'disable' of '_lsprof.Profiler' objects} <-
- <pstats.Stats object at 0x000002AA0A6A8908>
这里, 我们可以看到, 在每一行最后会有调用这部分的父函数名称, 这样我们就可以定位到对某一操作最费时的那个函数.
我们还可以显示那个函数调用了其它函数
- >>> p.print_callees()
- Ordered by: cumulative time
- Function called...
- ncalls tottime cumtime
- {built-in method builtins.exec} -> 1 0.060 30.461 code.py:1(<module>)
- code.py:1(<module>) -> 1 0.000 0.000 <frozen importlib._bootstrap>:989(_handle_fromlist)
- 1 0.000 0.000 code.py:12(timefn)
- 1 1.509 30.400 code.py:35(calc_pure_python)
- code.py:35(calc_pure_python) -> 1 0.000 28.516 code.py:13(measure_time)
- 2 0.000 0.000 {built-in method builtins.len}
- 3 0.003 0.003 {built-in method builtins.print}
- 1 0.012 0.012 {built-in method builtins.sum}
- 2 0.000 0.000 {built-in method time.time}
- 2002000 0.360 0.360 {method 'append' of 'list' objects}
- code.py:13(measure_time) -> 1 20.032 28.515 code.py:22(calculate_z_serial_purepython)
- 1 0.000 0.000 {built-in method builtins.print}
- 2 0.000 0.000 {built-in method time.time}
- code.py:22(calculate_z_serial_purepython) -> 34219980 8.483 8.483 {built-in method builtins.abs}
- 2 0.000 0.000 {built-in method builtins.len}
- {built-in method builtins.abs} ->
- {method 'append' of 'list' objects} ->
- {built-in method builtins.sum} ->
- {built-in method builtins.print} ->
- code.py:12(timefn) -> 1 0.000 0.000 C:\Users\ITryagain\AppData\Local\conda\conda\envs\tensorflow-gpu\lib\functools.py:44(update_wrapper)
- 1 0.000 0.000 C:\Users\ITryagain\AppData\Local\conda\conda\envs\tensorflow-gpu\lib\functools.py:74(wraps)
- C:\Users\ITryagain\AppData\Local\conda\conda\envs\tensorflow-gpu\lib\functools.py:44(update_wrapper) -> 7 0.000 0.000 {built-in method builtins.getattr}
- 5 0.000 0.000 {built-in method builtins.setattr}
- 1 0.000 0.000 {method 'update' of 'dict' objects}
- {built-in method time.time} ->
- <frozen importlib._bootstrap>:989(_handle_fromlist) -> 1 0.000 0.000 {built-in method builtins.hasattr}
- {built-in method builtins.getattr} ->
- {built-in method builtins.hasattr} ->
- {built-in method builtins.len} ->
- C:\Users\ITryagain\AppData\Local\conda\conda\envs\tensorflow-gpu\lib\functools.py:74(wraps) ->
- {method 'update' of 'dict' objects} ->
- {built-in method builtins.setattr} ->
- {method 'disable' of '_lsprof.Profiler' objects} ->
- <pstats.Stats object at 0x000002AA0A6A8908>
line_profiler 逐行分析
前面我们通过 cProfile 来对代码进行了整体的分析, 当我们确定了耗时多的函数后, 想对该函数进行进一步分析时, 就可以使用 line_profiler 了.
先安装
pip install line_profiler
或
conda install line_profiler
在需要测试的函数前面加上修饰器 @profile, 然后命令函输入
kernprof -l -v code.py
输出
- Wrote profile results to code.py.lprof
- Timer unit: 1e-07 s
- Total time: 137.019 s
File: code.py
- Function: calculate_z_serial_purepython at line 23
- Line # Hits Time Per Hit % Time Line Contents
- ==============================================================
- 23 @profile
- 24 def calculate_z_serial_purepython(maxiter, zs, cs):
- 25 1 89776.0 89776.0 0.0 output = [0] * len(zs)
- 26 1000001 9990393.0 10.0 0.7 for i in range(len(zs)):
- 27 1000000 9244029.0 9.2 0.7 n = 0
- 28 1000000 10851654.0 10.9 0.8 z = zs[i]
- 29 1000000 10242762.0 10.2 0.7 c = cs[i]
- 30 34219980 558122806.0 16.3 40.7 while abs(z) < 2 and n < maxiter:
- 31 33219980 403539388.0 12.1 29.5 z = z * z + c
- 32 33219980 356918574.0 10.7 26.0 n += 1
- 33 1000000 11186107.0 11.2 0.8 output[i] = n
- 34 1 12.0 12.0 0.0 return output
运行时间比较长.. 不过, 这里可以发现, 耗时的操作主要都在 while 循环中, 做判断的耗时最长, 但是这里我们并不知道是 abs(z) < 2 还是 n < maxiter 更花时间. z 与 n 的更新也比较花时间, 这是因为在每次循环时, Python 的动态查询机制都在工作.
那么, 这里可以通过 timeit 来进行测试
- In [1]: z = 0 + 0j
- In [2]: %timeit abs(z) < 2
- 357 ns ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
- In [3]: n = 1
- In [4]: maxiter = 300
- In [5]: %timeit n < maxiter
- 119 ns ± 6.91 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
可以看到, n < maxiter 所需时间更短, 并且每 301 次会有一次 False, 而 abs(z) < 2 为 False 的次数我们并不好估计, 占比约为前面图片中白色部分所占比例. 因此, 我们可以假设交换两条语句的顺序可以使得程序运行速度更快.
Total time: 132.816 s
File: code.py
- Function: calculate_z_serial_purepython at line 23
- Line # Hits Time Per Hit % Time Line Contents
- ==============================================================
- 23 @profile
- 24 def calculate_z_serial_purepython(maxiter, zs, cs):
- 25 1 83002.0 83002.0 0.0 output = [0] * len(zs)
- 26 1000001 9833163.0 9.8 0.7 for i in range(len(zs)):
- 27 1000000 9241272.0 9.2 0.7 n = 0
- 28 1000000 10667576.0 10.7 0.8 z = zs[i]
- 29 1000000 10091308.0 10.1 0.8 c = cs[i]
- 30 34219980 531157092.0 15.5 40.0 while n < maxiter and abs(z) < 2:
- 31 33219980 393275303.0 11.8 29.6 z = z * z + c
- 32 33219980 352964180.0 10.6 26.6 n += 1
- 33 1000000 10851379.0 10.9 0.8 output[i] = n
- 34 1 11.0 11.0 0.0 return output
可以看到, 确实是有所优化.
小节
从开始学习编程到现在差不多快 3 年了, 之前可以说是从来没有利用这些工具来对代码性能进行过分析, 最多也只是通过算法复杂度的分析来进行优化, 接触了这些之后就感觉, 需要学习的东西还有很多. 在近期进行的华为软挑中, 队友也曾对代码 (C++) 的运行时间进行过分析, 如下图.
下篇将介绍对运行时内存的分析.
参考
《Python 高性能编程》
来源: https://www.cnblogs.com/csu-lmw/p/12774328.html