爬虫 urllib.request 模块

爬虫网络请求方式的一种

爬虫数据提取方式我们用的是正则表达式 https://www.cnblogs.com/eunuch/p/9157546.html

我们用到的:

re 模块在我的随笔中有这个

Request 用来创建请求对象

urlopen 发送请求

导入:

import re
from urllib.request import Request, urlopen
class CSDNSpider(object):
    def __init__(self,url):
       self.url = url
       #设置浏览器标识
        self.user_agent = " "
    def get_page_code(self):
       #创建请求对象
       request = Request(url = self.url , headers = {'User-Agent':self.user_agent})
       #发送请求
        try:
             response = urlopen(request)
             # 从响应对象中获取源代码字符串.
             # response.read(): <class 'bytes'> 字节类型, python3 新增
             # decode(): 将 bytes 类型转成 str 类型
             # encode():  将 str 类型转成 bytes 类型
              data = response.read().decode()
              except Exception as e:
                  print('请求异常')
               else:
                  return data
     def parse_data_by_html(self,html):
           """
            解析 Html, 获取数据
            :param html: 源代码
            :return: 返回解析的数据
            """pattern = re.compile(r'   ' , re.S)
            res = re.findall(pattern, html)
            return  res
res 中的数据可能含有一些我们不需要的字符串        注: 因为我们用的正则匹配的对象是字符串, 所以匹配出来的可能含一些杂乱的字符串
所以我们要对 res 进行处理
方法是创建一个处理数据的函数
class DataParserTool(object):
    @classmethod
    def parser_data(cls, data):
        """
        处理数据
        :param data: 数据元组 [(), (),()]
        :return: [(), (), ()]
        """
        data_list = []

for n1, n2, n3, n4 ,n5,n6 in data:

n1 =n1.strip() # 去除两端空格
            n2 = n2.replace('\n', '')

data_list.append((n1, n2, n3, n4 ,n5,n6))

return data_list

@classmethod 调用对象方法 DataParserTool.parser_data()

不加的话调用对象在调方法 DataParserTool().parser_data()

来源: http://www.bubuko.com/infodetail-2637272.html

与本文相关文章

暂无,快来抢沙发吧！