最近在学习 xpath, 在网上找资料的时候, 发现一个新手经常拿来练手的项目, 爬取猫眼电影前一百名排行的信息, 很多都是跟崔庆才的很雷同, 基本照抄. 这里就用 xpath 自己写了一个程序, 同样也是爬取猫眼电影, 获取的信息是一样的, 这里提供一个另外的解法.
说实话, 对于网页信息的匹配, 还是推荐用 xpath, 虽然正则确实也能达到效果, 但是语句过于繁琐, 一不注意就匹配不出东西, 特别对于新手, 本身就不熟悉正则表达式, 错了都找不出来, 容易劝退. 正则我一般用于在处理文件, 简直神器.
下面贴代码.
- import requests
- from requests.exceptions import RequestException
- from lxml import etree
- import CSV
- import re
- def get_page(url):
- """
- 获取网页的源代码
- :param url:
- :return:
- """
- try:
- headers = {
- 'User-Agent': 'Mozilla / 5.0(X11;Linuxx86_64) ApplewebKit / 537.36(Khtml, likeGecko) Chrome /'
- '76.0.3809.100Safari / 537.36',
- }
- response = requests.get(url, headers=headers)
- if response.status_code == 200:
- return response.text
- return None
- except RequestException:
- return None
- def parse_page(text):
- """
- 解析网页源代码
- :param text:
- :return:
- """
- HTML = etree.HTML(text)
- movie_name = HTML.xpath("//p[@class='name']/a/text()")
- actor = HTML.xpath("//p[@class='star']/text()")
- actor = list(map(lambda item: re.sub('\s+', '', item), actor))
- time = HTML.xpath("//p[@class='releasetime']/text()")
- grade1 = HTML.xpath("//p[@class='score']/i[@class='integer']/text()")
- grade2 = HTML.xpath("//p[@class='score']/i[@class='fraction']/text()")
- new = [grade1[i] + grade2[i] for i in range(min(len(grade1), len(grade2)))]
- ranking = HTML.xpath("///dd/i/text()")
- return zip(ranking, movie_name, actor, time, new)
- def change_page(number):
- """
- 翻页
- :param number:
- :return:
- """ base_url ='https://maoyan.com/board/4' url = base_url +'?offset=%s' % number
- return url
- def save_to_csv(result, filename):
- """
- 保存
- :param result:
- :param filename:
- :return:
- """ with open('%s'% filename,'a') as csvfile:
- writer = CSV.writer(csvfile, dialect='excel')
- writer.writerow(result)
- def main():
- """
- 主函数
- :return:
- """
- for i in range(0, 100, 10):
- url = change_page(i)
- text = get_page(url)
- result = parse_page(text)
- for j in result:
- save_to_csv(j, filename='message.csv')
- if __name__ == '__main__':
- main()
来源: https://www.cnblogs.com/lattesea/p/11463236.html