Python爬虫的道德规范---robots协议

编写爬虫程序爬取数据之前，为了避免某些有版权的数据后期带来的诸多法律问题，

可以通过查看网站的robots.txt文件来避免爬取某些网页。

robots协议，告知爬虫等搜索引擎那些页面可以抓取，哪些不能。它只是一个通行的道德规范，

没有强制性规定，完全由个人意愿遵守。作为一名有道德的技术人员，遵守robots协议，

有助于建立更好的互联网环境。

网站的robots文件地址通常为网页主页后加robots.txt，如 www.taobao.com/robots.txt

一个简单判断用户代理是否符合robots文件规定的小程序，符合条件即下载网页：

import robotparser
import urllib2
def download(url, user_agent=‘wswp‘, num_retries=2):
    print ‘Downloading:‘, url
    headers = {‘User-agent‘: user_agent}
    request = urllib2.Request(url, headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print ‘Download error:‘, e.reason
        html = None
        if num_retries > 0:
           if hasattr(e, ‘code‘) and 500 <= e.code < 600:
               return download(url,num_retries-1)
    return html
def can_be_download(url, user_agent=‘wswp)            #设置一个默认的用户代理
    rp = robotparser.RobotFileParser()
    url = url.split(‘/‘)[2]                #获取主页网址
    rp.set_url(‘http://‘ + str(url) + ‘/robots.txt‘)  #robots.txt地址
    rp.read()
    if rp.can_fetch(user_agent=‘wswp‘, url):
        download(url)

Python爬虫的道德规范---robots协议

来源: http://www.bubuko.com/infodetail-2288328.html

与本文相关文章

暂无,快来抢沙发吧！