当前位置：

首页
/
IT
/
程序
/
Python
/
Python: 怎样用线程将任务并行化?

Python: 怎样用线程将任务并行化?

如果待处理任务满足:

可拆分, 即任务可以被拆分为多个子任务, 或任务是多个相同的任务的集合;

任务不是 CPU 密集型的, 如任务涉及到较多 IO 操作(如文件读取和网络数据处理)

则使用多线程将任务并行运行, 能够提高运行效率.

假设待处理的任务为: 有很多文件目录, 对于每个文件目录, 搜索匹配一个给定字符串的文件的所有行(相当于是实现 grep 的功能). 则此处子任务为: 给定一个目录, 搜索匹配一个给定字符串的文件的所有行. 总的任务为处理所有目录.

将子任务表示为一个函数 T, 如下所示:

def T(dir, pattern):
  print('searching pattern %s in dir %s' % (pattern, dir))
  ...

为每个子任务创建一个线程

要实现并行化, 最简单的方法是为每一个子任务创建一个 thread,thread 处理完后退出.

from threading import Thread
from time import sleep
def T(dir, pattern):
  "This is just a stub that simulate a dir operation"
  sleep(1)
  print('searching pattern %s in dir %s' % (pattern, dir))
threads = []
dirs = ['a/b/c', 'a/b/d', 'b/c', 'd/f']
pattern = 'hello'
for dir in dirs:
  thread = Thread(target=T, args=(dir, pattern))   1
  thread.start()   2
  threads.append(thread)
for thread in threads:
  thread.join()   3
print('Main thread end here')

1 : 创建一个 Thread 对象, target 参数指定这个 thread 待执行的函数, args 参数指定 target 函数的输入参数

2 : 启动这个 thread. T(dir, pattern)将被调用

3 : 等待, 直到这个 thread 结束. 整个 for 循环表示主进程会等待所有子线程结束后再退出

程序的运行结果为:

searching pattern hello in dir a/b/csearching pattern hello in dir d/f

searching pattern hello in dir b/c

searching pattern hello in dir a/b/d
Main thread end here

可以看出由于线程是并行运行的, 部分输出会交叠. 但主进程的打印总在最后.

以上例子中对于每个 dir 都需要创建一个 thread. 如果 dir 的数目较多, 则会创建太多的 thread, 影响运行效率. 较好的方式是限制总线程的数目.

限制线程数目

可以使用信号量 (semaphore) 来限制同时运行的最大线程数目. 如下所示:

from threading import Thread, BoundedSemaphore
from time import sleep
def T(dir, pattern):
  "This is just a stub that simulate a dir operation"
  sleep(1)
  print('searching pattern %s in dir %s' % (pattern, dir))
threads = []
dirs = ['a/b/c', 'a/b/d', 'b/c', 'd/f']
pattern = 'hello'
maxjobs = BoundedSemaphore(2)   1
def wrapper(dir, pattern):
  T(dir, pattern)
  maxjobs.release()   2
for dir in dirs:
  maxjobs.acquire()   3
  thread = Thread(target=wrapper, args=(dir, pattern))
  thread.start()
  threads.append(thread)
for thread in threads:
  thread.join()
print('Main thread end here')

1 : 创建一个有 2 个资源的信号量. 一个信号量代表总的可用的资源数目, 这里表示同时运行的最大线程数目为 2.

2 : 在线程结束时释放资源. 运行在子线程中.

3 : 在启动一个线程前, 先获取一个资源. 如果当前已经有 2 个线程在运行, 则会阻塞, 直到其中一个线程结束. 运行在主线程中.

当限制了最大运行线程数为 2 后, 由于只有 2 个线程同时运行, 程序的输出更加有序, 几乎总是为:

searching pattern hello in dir a/b/c
searching pattern hello in dir a/b/d

searching pattern hello in dir b/c

searching pattern hello in dir d/f

Main thread end here

以上实现中为每个子任务创建一个线程进行处理, 然后通过信号量限制同时运行的线程的数目. 如果子任务很多, 这种方法会创建太多的线程. 更好的方法是使用线程池.

使用线程池(THREAD POOL)

即预先创建一定数目的线程, 形成一个线程池. 每个线程持续处理多个子任务(而不是处理一个就退出). 这样做的好处是: 创建的线程数目会比较固定.

那么, 每个线程处理哪些子任务呢? 一种方法为: 预先将所有子任务均分给每个线程. 如下所示:

from threading import Thread
from time import sleep
def T(dir, pattern):
  "This is just a stub that simulate a dir operation"
  sleep(1)
  print('searching pattern %s in dir %s' % (pattern, dir))
dirs = ['a/b/c', 'a/b/d', 'b/c', 'd/f']
pattern = 'hello'
def wrapper(dirs, pattern):   1
  for dir in dirs:
    T(dir, pattern)
threadsPool = [   2
  Thread(target=wrapper, args=(dirs[0:2], pattern)),
  Thread(target=wrapper, args=(dirs[2:], pattern)),
]
for thread in threadsPool:   3
  thread.start()
for thread in threadsPool:
  thread.join()
print('Main thread end here')

1 : 这个函数能够处理多个 dir, 将作为线程的 target 函数

2 : 创建一个有 2 个线程的线程池. 并事先分配子任务给每个线程. 线程 1 处理前两个 dir, 线程 2 处理后两个 dir

3 : 启动线程池中所有线程

程序的输出结果为:

searching pattern hello in dir a/b/csearching pattern hello in dir b/c

searching pattern hello in dir d/f

searching pattern hello in dir a/b/d
Main thread end here

这种方法存在以下问题:

子任务分配可能不均. 导致每个线程运行时间差别可能较大, 则整体运行时长可能被拖长

只能处理所有子任务都预先知道的情况, 无法处理子任务实时出现的情况

如果有一种方法, 能够让线程知道当前所有的待处理子任务, 线程一旦空闲, 便可以从中获取一个任务进行处理, 则以上问题都可以解决. 任务队列便是解决方案.

使用消息队列

可以使用 Queue 实现一个任务队列, 用于在线程间传递子任务. 主线程将所有待处理子任务放置在队列中, 子线程从队列中获取子任务去处理. 如下所有(注: 以下代码只运行于 Python 2, 因为 Queue 只存在于 Python 2) :

from threading import Thread
from time import sleep
import Queue
def T(dir, pattern):
  "This is just a stub that simulate a dir operation"
  sleep(1)
  print('searching pattern %s in dir %s' % (pattern, dir))
dirs = ['a/b/c', 'a/b/d', 'b/c', 'd/f']
pattern = 'hello'
taskQueue = Queue.Queue()   1
def wrapper():
  while True:
    try:
      dir = taskQueue.get(True, 0.1)   2
      T(dir, pattern)
    except Queue.Empty:
	continue
threadsPool = [Thread(target=wrapper) for i in range(2)]   3
for thread in threadsPool:
  thread.start()    4
for dir in dirs:
  taskQueue.put(dir)   5
for thread in threadsPool:
  thread.join()
print('Main thread end here')

1 : 创建一个任务队列

2 : 子线程从任务队列中获取一个任务. 第一个参数为 True, 表示如果没有任务, 会等待. 第二个参数表示最长等待 0.1 秒如果在 0.1 秒后仍然没有任务, 则会抛出一个 Queue.Empty 的异常

3 : 创建有 2 个线程的线程池. 注意 target 函数 wrapper 没有任何参数

4 : 启动所有线程

5 : 主线程将所有子任务放置在任务队列中, 以供子线程获取处理. 由于子线程已经被启动, 则子线程会立即获取到任务并处理

程序的输出为:

searching pattern hello in dir a/b/c
searching pattern hello in dir a/b/d

searching pattern hello in dir b/c

searching pattern hello in dir d/f

从中可以看出主进程的打印结果并没有出来, 程序会一直运行, 而不退出. 这个问题的原因是: 目前的实现中, 子线程为一个无限循环, 因此其永远不会终止. 因此, 必须有一种机制来结束子进程.

终止子进程

一种简单方法为, 可以在任务队列中放置一个特殊元素, 作为终止符. 当子线程从任务队列中获取这个终止符后, 便自行退出. 如下所示, 使用 None 作为终止符.

from threading import Thread
from time import sleep
import Queue
def T(dir, pattern):
  "This is just a stub that simulate a dir operation"
  sleep(1)
  print('searching pattern %s in dir %s' % (pattern, dir))
dirs = ['a/b/c', 'a/b/d', 'b/c', 'd/f']
pattern = 'hello'
taskQueue = Queue.Queue()
def wrapper():
  while True:
    try:
      dir = taskQueue.get(True, 0.1)
      if dir is None:   1
	taskQueue.put(dir)   2
	break
      T(dir, pattern)
    except Queue.Empty:
	continue
threadsPool = [Thread(target=wrapper) for i in range(2)]
for thread in threadsPool:
  thread.start()
for dir in dirs:
  taskQueue.put(dir)
taskQueue.put(None)   3
for thread in threadsPool:
  thread.join()
print('Main thread end here')

1 : 如果任务为终止符(此处为 None), 则退出

2 : 将这个终止符重新放回任务队列. 因为只有一个终止符, 如果不放回, 则其它子线程获取不到, 也就无法终止

3 : 将终止符放在任务队列. 注意必须放置在末尾, 否则终止符后的任务无法得到处理

修改过后, 程序能够正常运行, 主进程能够正常退出了.

searching pattern hello in dir a/b/csearching pattern hello in dir a/b/d

searching pattern hello in dir b/c

searching pattern hello in dir d/f

Main thread end here

总结

要并行化处理子任务, 最简单的方法是为每个子任务创建一个线程去处理. 这种方法的缺点是: 如果子任务非常多, 则需要创建的线程数目会非常多. 并且同时运行的线程数目也会较多. 通过使用信号量来限制同时运行的线程数目, 通过线程池来避免创建过多的线程.

与每个线程处理一个任务不同, 线程池中每个线程会处理多个子任务. 这带来一个问题: 每个子线程如何知道要处理哪些子任务. 一种方法是预先将所有子任务均分给每个线程, 而更灵活的方法则是通过任务队列, 由子线程自行决定要处理哪些任务.

使用线程池时, 线程主函数通常实现为一个无限循环, 因此需要考虑如何终止线程. 可以在任务队列中放置一个终止符来告诉线程没有更多任务, 因此其可以终止.

来源: https://www.cnblogs.com/astropeak/p/9035300.html

与本文相关文章

暂无,快来抢沙发吧！