功能:
知乎算是对爬虫比较友好的网站了,但是!
现在登陆验证码很恶心,需要点击图中倒立的文字!这让我们这种本来识字就不多的人情何以堪 /(ㄒ o ㄒ)/~~。于是采用替换 url 参数的方法,换个登陆验证码登陆,验证码需要手动输入。
- #coding=utf-8
- import urllib2
- import re
- from bs4 import BeautifulSoup
- import requests
- import time
- url = "https://www.zhihu.com/#signin"
- requests.adapters.DEFAULT_RETRIES = 511
- ###
- 定义函数,输出验证码
- ###
- def captcha(captcha_data):
- with open("captcha.jpg","wb")as f:
- f.write(captcha_data)
- text = raw_input("请输入验证码:")
- return text
- ###
- 通过抓包可以发现,验证码图片是由unix时间戳生成的,于是我们可以自己抓出登陆时的四位数字字母验证码
- ###
- def zhihulogin():
- while True:
- try:
- sess = requests.Session()
- sess.keep_alive = False
- headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) ApplewebKit/537.36 (Khtml, like Gecko) Chrome/62.0.3202.94 Safari/537.36"}
- html = sess.get(url,headers = headers).text
- bs = BeautifulSoup(html,"lxml")
- _xsrf = bs.find("input",attrs={"name":"_xsrf"}).get("value")
- captcha_url = "https://www.zhihu.com/captcha.gif?r=%d&type=login" %(time.time() * 1000)
- captcha_data = sess.get(captcha_url,headers =headers).content
- text = captcha(captcha_data)
- data = {
- "_xsrf":_xsrf,
- "email":email,
- "password":pwd,
- "captcha":text
- }
- response = sess.post("https://www.zhihu.com/login/email",data = data,headers = headers)
- print response.text
- print captcha_url
- print _xsrf
- break
- except:
- ###
- 这里不加sleep的话,好像会报ssl最大连接数的错误,具体的错误日志忘了(lll¬ω¬)
- ###
- time.sleep(5)
- if __name__ == "__main__":
- zhihulogin()
执行时需要把脚本里面的用户名和密码换成有效的,然后执行脚本,打开抓出的图片,输入验证码,就可以登陆知乎啦
来源: http://www.jianshu.com/p/4d39f7b5db28