python 爬虫之 urllib 库

请求库 urllib

urllib 主要分为几个部分

urllib.request 发送请求

urllib.error 处理请求过程中出现的异常

urllib.parse 处理 url

urllib.robotparser 解析 robots.txt --> 规定了该网站的爬虫权限

urllib.request 方法

data = urllib.request.urlopen(url) #返回 response 对象

data.read() ---> 取出网页源代码 (bytes 类型, 可以通过 decode() 转成 utf-8) 注意网页源代码不包括 js 处理后的数据

data.info() ---> 取出响应的头信息

data.getcode() ---> 取出返回码

data.geturl() ---> 取出请求的 url

用脚本发出的请求, headers 中的 User-Agent 是 python-urllib/3.6, 有些网站会根据这个来识别请求是不是脚本发出的, 进而过滤掉爬虫, 那我们怎么来模拟浏览器访问呢?

用 urllib.request.Request() 来携带 headers 头信息

headers = {
'User=Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) ApplewebKit/537.36 (Khtml, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
}
request = urllib.request.Request(url,headers=headers) #实例 Request() 对象, Request 对象初始化可以传入 headers
data = urllib.request.urlopen(request) #urlopen 不仅可以传入一个 url, 并且可以传入一个 Request 对象

urllib.request 是如何区分请求时 get 还是 post 呢?

看看 urllib.request 的源代码, 可以知道, 如果如果请求携带 data 参数, 则为 post 请求, 反之则为 get 请求

Cookies 使用

固定写法

import http.cookiejar
# 创建 cookieJar 对象
cookie_jar = http.cookiejar.CookieJar()
# 使用 HTTPCookieProcessor 创建 cookie 处理器, 并以它为参数构建 opener 对象
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
# 把 opener 变成 urlopen, 安装之后, 给 urlopen 加上了保存 cookies 功能
urllib.request.install_opener(opener) #如果不想安装也可以, 直接用 opener.open(url) 打开

设置代理

固定写法

proxy = {'http':'183.232.188.18:80','https':'183.232.188.18:80'} # 代理 ip:port
# 创建代理处理器
proxies = urllib.request.ProxyHandler(proxy)
# 创建 opener 对象
opener = urllib.request.build_opener(proxies,urllib.request.HTTPHandler)
urllib.request.install_opener(opener) # 如果不想安装也可以, 直接用 opener.open(url) 打开, 这样 opener 就有代理而 urlopen 没有
urllib.error

URLError 是父类断网或者服务器不存在有异常原因, 没有 code 属性

HTTPError 是子类服务器存在, 但是地址不存在, 有 code 和 reason 属性

#url = 'http://www.adfdsfdsdsa.com' URLError
#url = 'https://jianshu.com/p/jkhsdhgjkasdhgjkadfhg' HTTPError

判断方法

try:
    data = urllib.request.urlopen(url)
    print(data.read().decode())
except urllib.error.URLError as e:
    if hasattr(e,'code'):
        print('HTTPError')
    elif hasattr(e,'reason'):
        print('URLError')
urllib.parse
import urllib.parse
# urllib.parse.urljoin() # 拼接 url 字符串拼接, 若都是域名, 则以后面覆盖
# urllib.parse.urlencode() # 把字典转查询字符串
# urllib.parse.quote() # url 采用 ascii 码编码, 出现中文时需要进行 url 编码
# urllib.parse.unquote() # url 解码
#
urlencode
>>> from urllib import parse
>>> query = {
'name': 'walker',
'age': 99,
}
>>> parse.urlencode(query)
'name=walker&age=99'
quote/quote_plus
>>> from urllib import parse
>>> parse.quote('a&b/c') #未编码斜线
'a&b/c'
>>> parse.quote_plus('a&b/c') #编码了斜线
'a&b/c'
unquote/unquote_plus
from urllib import parse
>>> parse.unquote('1+2') #不解码加号
'1+2'
>>> parse.unquote('1+2') #把加号解码为空格
'1 2'
url = 'http://www.baidu.com/?wd = 书包'
urllib.request.urlopen(url)  #会报错编码错误
# 如果想正确请求, 需要 url = 'http://www.baidu.com/?wd={}'.format(urllib.parse.quote('书包')) 先把中文进行 url 编码, 然后再去请求
urllib3

requests 库底层用的 urllib3

import urllib3
 http = urllib3.PoolManager()
r = http.request('GET','https://www.jianshu.com',redirect=False) # 关闭重定向
print(r.status)

来源: http://www.bubuko.com/infodetail-2667009.html

与本文相关文章

暂无,快来抢沙发吧！