第一次写 Python 爬虫

今天遇朋友求助, 要从某商业网站上搜集供应商信息. 因为此事甚急, 遂放下手头事情, 着手写 python3 爬虫. 话说这还是我第一次正式写 python3 程序, 也是第一次写爬虫. 还好, 比预想的顺利.

以下是爬虫主要代码, 其中略有处理及删节:

#!/usr/bin/env python3
import requests
import json
from lxml import etree
# url 已打马赛克
url_main = 'https://xxxx.com/data/ajax/get_offer_list.json?beginpage='
url_params = '&keywords=%XX%XX%XX%XX%XX%XX%XX&sortType=&descendOrder=....'
page_count = 50  #源数据有分页, 页码越界会仍旧返回最后一页. 干脆直接设定总页数
for pi in range(page_count):
  url_req = url_main + str(pi) + url_params
  resp = requests.get(url_req).content.decode('utf-8')
  idx_from = resp.find('(') + 1
  json_obj = resp[idx_from:-1]  #取得返回的 JSON 对象
  hjson = json.loads(json_obj)
  count = hjson['data']['offerCount']  #每页里的总条数
  for i in range(count):
    idx = pi*20+i+1  #每条数据设一个序列号
    cname = hjson['data']['content']['offerResult'][i]['attr']['company']['name']  #使用 XPATH 定位到供应商名称
    eurl = hjson['data']['content']['offerResult'][i]['eurl']   #找到供应商 URL 链接
    if eurl.startswith('//'):
      eurl = "https:" + eurl  #部分链接需要加上协议头
    res_detail = requests.get(eurl)  #获取供应商详情页面内容
    res_detail = res_detail.content.decode('GBK')  #页面用的 GBK 编码, 需要解码
    selector = etree.html(res_detail)
    names = selector.xpath('//a[@class="membername"]')  #找到联系人名称
    name = ''
    if len(names)> 0:
      name = names[0].text
    else:
      names = selector.xpath('//span[@class="disc"]/a')  #联系人有时候会存在另外一处
      if len(names)> 0 and not names[0].text.isspace():
        name = names[0].text
      else:
        names = selector.xpath('//div[@class="contactSeller"]/a')  #联系人还有可能放在这里
        if len(names)> 0 and not names[0].text.isspace():
          name = names[0].text
    cells = selector.xpath('//dl[@class="m-mobilephone"]/@data-no')  #找到联系电话
    cell = ''
    if len(cells)> 0:
      cell = cells[0]
    print(idx, cname, name, cell)  #找到了, 输出相关信息

为了方便使用, 爬虫程序的输出结果存在 test.txt 里. 检查文档发现, 供应商数据填的不完全规范, 比如有的就没有填联系电话, 或者把联系电话制作进了图片里. 为了简化, 凡是不符合规范的一律忽略.

恰好此前还有一份朋友给的 Excel 通讯录, 后来稍经处理转成了 CSV, 正好跟这次的爬虫结果一并整合, 直接生成 VCF 文件, 这样用起来也方便. 于是乎说干就干:

#!/usr/bin/env python3
// 联系人字典
mans = {}
# 这个是爬虫结果文件
txtFile = open('test.txt', 'r')
for line in txtFile:
  segs = line.split()
  if len(segs) < 4:  #格式不规范的一律忽略
    continue
  mobile = int(segs[3])  #电话号码作为字典索引, 防止重复
  if mobile not in mans:
    mans[mobile] = segs[1] + ' ' + segs[2]
txtFile.close()
# 这个是已有的 CSV 通讯录
csvFile = open('test.csv', 'r')
for line in csvFile:
  segs = line.split(',')
  if (segs[2].endswith('\n')):  #从 Windown 平台生成的, 去掉末尾换行符
    segs[2] = segs[2][:-1]
  if not segs[2].isdigit():  #不规范的忽略
    continue
  mobile = int(segs[2])
  if mobile not in mans:
    mans[mobile] = segs[1] + ' ' + segs[0]
csvFile.close()
# 准备写入的 VCF 文件
vcfFile = open('final.vcf', 'w')
# 逐行生成 VCard 联系人信息
for mobile, addr_name in mans.items():
  idx = addr_name.find(' ')
  addr = addr_name[:idx]
  name = addr_name[idx+1:]
  vcfFile.write('BEGIN:VCARD\n')
  vcfFile.write('VERSION:3.0\n')
  vcfFile.write('TEL;type=CELL;type=VOICE;type=pref:' + str(mobile) + '\n')
  vcfFile.write('FN:' + name + '\n')
  vcfFile.write('ORG:' + addr + '\n')
vcfFile.close()

朋友测试使用后, 十分高兴, 以前至少一个礼拜的工作, 让我半天搞定了.

你的赞赏, 我的动力!

来源: http://www.jianshu.com/p/58e33dffafa7

与本文相关文章

暂无,快来抢沙发吧！