- # 注意一下 是 importurllib.request 还是 form urllib import request
- 0. urlopen()
语法: urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
实例 0:(这个函数 一般就使用三个参数 url data timeout)
* 添加的 data 参数需要使用 bytes()方法将参数转换为字节流 (区别于 str 的一种类型 是一种比特流 010010010) 编码的格式的内容, 即 bytes 类型.
*response.read()是 bytes 类型的数据, 需要 decode(解码)一下.
- import urllib.parse
- import urllib.request
- import urllib.error
- url = 'http://httpbin.org/post'
- data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
- try:
- response = urllib.request.urlopen(url, data=data,timeout=1)
- except urllib.error.URLError as e:
- if isinstance(e.reason, socket.timeout):
- print('TIME OUT')
- else:
- print(response.read().decode("utf-8"))
输出结果:
- {
- "args": {},
- "data": "",
- "files": {},
- "form": {
- "word": "hello"
- },
- "headers": {
- "Accept-Encoding": "identity",
- "Content-Length": "10",
- "Content-Type": "application/x-www-form-urlencoded",
- "Host": "httpbin.org",
- "User-Agent": "Python-urllib/3.6"
- },
- "json": null,
- "origin": "101.206.170.234, 101.206.170.234",
- "url": "https://httpbin.org/post"
- }
实例 1: 查看 i 状态码, 响应头, 响应头里 server 字段的信息
- import urllib.request
- response = urllib.request.urlopen('https://www.python.org')
- print(response.status)
- print(response.getheaders())
- print(response.getheader('Server'))
输出结果:
- 200
- [('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur'), ('Via', '1.1 varnish'), ('Content-Length', '48410'), ('Accept-Ranges', 'bytes'), ('Date', 'Tue, 09 Apr 2019 02:32:34 GMT'), ('Via', '1.1 varnish'), ('Age', '722'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2126-IAD, cache-hnd18751-HND'), ('X-Cache', 'MISS, HIT'), ('X-Cache-Hits', '0, 1223'), ('X-Timer', 'S1554777154.210361,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
- nginx
使用 urllib 库的 urlopen()方法有很大的局限性, 比如不能设置响应头的信息等. 所以需要引入 request()方法.
1. Request()
实例 0:(这两种方法的实现效果是一样的)
- import urllib.request
- response = urllib.request.urlopen('https://www.python.org')
- print(response.read().decode('utf-8'))
- ######################################
- import urllib.request
- req = urllib.request.Request('https://python.org')
- response = urllib.request.urlopen(req)
- print(response.read().decode('utf-8'))
下面主要讲解下使用 Request()方法来实现 get 请求和 post 请求, 并设置参数.
实例 1:(post 请求)
- from urllib import request, parse
- url = 'http://httpbin.org/post'
- headers = {
- 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
- 'Host': 'httpbin.org'
- }
- dict = {
- 'name': 'Germey'
- }
- data = bytes(parse.urlencode(dict), encoding='utf8')
- req = request.Request(url=url, data=data, headers=headers, method='POST')
- response = request.urlopen(req)
- print(response.read().decode('utf-8'))
亦可使用 add_header()方法来添加报头, 实现浏览器的模拟, 添加 data 属性亦可如下书写:
补充: 还可以使用 bulid_opener()修改报头, 不过多阐述, 够用了就好.
- from urllib import request, parse
- url = 'http://httpbin.org/post'
- dict = {
- 'name': 'Germey'
- }
- data = parse.urlencode(dict).encode('utf-8')
- req = request.Request(url=url, data=data, method='POST')
- req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
- response = request.urlopen(req)
- print(response.read().decode('utf-8'))
实例 2:(get 请求) 百度关键字的查询
- from urllib import request,parse
- url = 'http://www.baidu.com/s?wd='
- key = '路飞'
- key_code = request.quote(key)
- url_all = url + key_code
- """
- # 第二种写法
- url = 'http://www.baidu.com/s'
- key = '路飞'
- wd = parse.urlencode({'wd':key})
- url_all = url + '?' + wd
- """
- req = request.Request(url_all)
- response = request.urlopen(req)
- print(response.read().decode('utf-8'))
在这里, 对编码 decode,reqest 模块里的 quote()方法, parse 模块的 urlencode()方法 等就有疑问了,, 对此, 做一些说明:
request.quote: 将 str 数据转换为对应的编码
parse.urlencode: 将字典中的 k:v 转换为 K: 编码后的 v
request.unquote: 将编码后的数据转化为编码前的数据
decode 字符串解码 decode("utf-8")跟 read()搭配很配!
encode 字符串编码
- >>> str0 = '我爱你'
- >>> str1 = str0.encode('gb2312')
- >>> str1
- b'\xce\xd2\xb0\xae\xc4\xe3'
- >>> str2 = str0.encode('gbk')
- >>> str2
- b'\xce\xd2\xb0\xae\xc4\xe3'
- >>> str3 = str0.encode('utf-8')
- >>> str3
- b'\xe6\x88\x91\xe7\x88\xb1\xe4\xbd\xa0'
- >>> str00 = str1.decode('gb2312')
- >>> str00
- '我爱你'
- >>> str11 = str1.decode('utf-8') #报错, 因为 str1 是 gb2312 编码的
- Traceback (most recent call last):
- File "<pyshell#9>", line 1, in <module>
- str11 = str1.decode('utf-8')
- UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte
* encoding 指定编码格式
在这里, 又有疑问了? read(),readline(),readlines()的区别:
read(): 全部, 字符串 str
reasline(): 一行
readlines(): 全部, 列表 list
来源: https://www.cnblogs.com/DC0307/p/10675878.html