用 PyCharm Profile 分析异步爬虫效率

第一个代码如下, 就是一个普通的 for 循环爬虫.

import requests
import bs4
from colorama import Fore
def main():
 get_title_range()
 print("Done.")
def get_html(episode_number: int) -> str:
 print(Fore.YELLOW + f"Getting HTML for episode {episode_number}", flush=True)
 url = f'https://talkpython.fm/{episode_number}'
 resp = requests.get(url)
 resp.raise_for_status()
 return resp.text
def get_title(HTML: str, episode_number: int) -> str:
 print(Fore.CYAN + f"Getting TITLE for episode {episode_number}", flush=True)
 soup = bs4.BeautifulSoup(HTML, 'html.parser')
 header = soup.select_one('h1')
 if not header:
 return "MISSING"
 return header.text.strip()
def get_title_range():
 # Please keep this range pretty small to not DDoS my site. ;)
 for n in range(185, 200):
 HTML = get_html(n)
 title = get_title(HTML, n)
 print(Fore.WHITE + f"Title found: {title}", flush=True)
if __name__ == '__main__':
 main()

这段代码跑完花了 37s, 然后我们用 pycharm 的 profiler 工具来具体看看哪些地方比较耗时间.

点击 Profile (文件名称)

image

之后获取到得到一个详细的函数调用关系, 耗时图:

image

可以看到 get_html 这个方法占了 96.7% 的时间. 这个程序的 IO 耗时达到了 97%, 获取 HTML 的时候, 这段时间内程序就在那死等着. 如果我们能够让他不要在那儿傻傻地等待 IO 完成, 而是开始干些其他有意义的事, 就能节省大量的时间.

稍微做一个计算, 试用 asyncio 异步抓取, 能将时间降低多少?

get_html 这个方法耗时 36.8s, 一共调用了 15 次, 说明实际上获取一个链接的 HTML 的时间为 36.8s / 15 = 2.4s. 要是全异步的话, 获取 15 个链接的时间还是 2.4s. 然后加上 get_title 这个函数的耗时 0.6s, 所以我们估算, 改进后的程序将可以用 3s 左右的时间完成, 也就是性能能够提升 13 倍.

再看下改进后的代码.

import asyncio
from asyncio import AbstractEventLoop
import aiohttp
import requests
import bs4
from colorama import Fore
def main():
 # Create loop
 loop = asyncio.get_event_loop()
 loop.run_until_complete(get_title_range(loop))
 print("Done.")
async def get_html(episode_number: int) -> str:
 print(Fore.YELLOW + f"Getting HTML for episode {episode_number}", flush=True)
 # Make this async with aiohttp's ClientSession
 url = f'https://talkpython.fm/{episode_number}'
 # resp = await requests.get(url)
 # resp.raise_for_status()
 async with aiohttp.ClientSession() as session:
 async with session.get(url) as resp:
 resp.raise_for_status()
 HTML = await resp.text()
 return HTML
def get_title(HTML: str, episode_number: int) -> str:
 print(Fore.CYAN + f"Getting TITLE for episode {episode_number}", flush=True)
 soup = bs4.BeautifulSoup(HTML, 'html.parser')
 header = soup.select_one('h1')
 if not header:
 return "MISSING"
 return header.text.strip()
async def get_title_range(loop: AbstractEventLoop):
 # Please keep this range pretty small to not DDoS my site. ;)
 tasks = []
 for n in range(190, 200):
 tasks.append((loop.create_task(get_html(n)), n))
 for task, n in tasks:
 HTML = await task
 title = get_title(HTML, n)
 print(Fore.WHITE + f"Title found: {title}", flush=True)
if __name__ == '__main__':
 main()

同样的步骤生成 profile 图:

image

可见现在耗时为大约 3.8s, 基本符合我们的预期了.

image

如果你依然在编程的世界里迷茫, 不知道自己的未来规划, 对 python 感兴趣, 这里推荐一下我的学习交流群: 556370268, 里面都是学习 python 的, 从最基础的 python[python, 游戏, 黑客技术, 网络安全, 数据挖掘, 爬虫] 到网络安全的项目实战的学习资料都有整理, 送给每一位 python 小伙伴, 希望能帮助你更了解 python, 学习 python

image

来源: http://www.jianshu.com/p/34c64ee865f3

与本文相关文章

暂无,快来抢沙发吧！