人生苦短, 我用 Python
前文传送门:
小白学 Python 爬虫(1): 开篇 https://www.geekdigging.com/2019/11/13/3303836941/
小白学 Python 爬虫 (2): 前置准备(一) 基本类库的安装 https://www.geekdigging.com/2019/11/20/2586166930/
小白学 Python 爬虫(3): 前置准备(二)Linux 基础入门 https://www.geekdigging.com/2019/11/21/1005563697/
小白学 Python 爬虫(4): 前置准备(三)Docker 基础入门 https://www.geekdigging.com/2019/11/22/3679472340/
小白学 Python 爬虫 (5): 前置准备(四) 数据库基础 https://www.geekdigging.com/2019/11/24/334078215/
小白学 Python 爬虫 (6): 前置准备(五) 爬虫框架的安装 https://www.geekdigging.com/2019/11/25/1881661601/
小白学 Python 爬虫(7):HTTP 基础 https://www.geekdigging.com/2019/11/26/1197821400/
小白学 Python 爬虫(8): 网页基础 https://www.geekdigging.com/2019/11/27/101847406/
小白学 Python 爬虫(9): 爬虫基础 https://www.geekdigging.com/2019/11/28/1668465912/
小白学 Python 爬虫(10):Session 和 Cookies https://www.geekdigging.com/2019/12/01/2475257648/
小白学 Python 爬虫(11):urllib 基础使用(一) https://www.geekdigging.com/2019/12/02/2333822325/
小白学 Python 爬虫(12):urllib 基础使用(二) https://www.geekdigging.com/2019/12/03/819896244/
小白学 Python 爬虫(13):urllib 基础使用(三) https://www.geekdigging.com/2019/12/04/2992515886/
小白学 Python 爬虫(14):urllib 基础使用(四) https://www.geekdigging.com/2019/12/05/104488944/
小白学 Python 爬虫(15):urllib 基础使用(五) https://www.geekdigging.com/2019/12/07/2788855167/
小白学 Python 爬虫(16):urllib 实战之爬取妹子图 https://www.geekdigging.com/2019/12/09/1691033431/
小白学 Python 爬虫(17):Requests 基础使用 https://www.geekdigging.com/2019/12/10/1910005577/
小白学 Python 爬虫(18):Requests 进阶操作 https://www.geekdigging.com/2019/12/11/1468953802/
小白学 Python 爬虫(19):Xpath 基操 https://www.geekdigging.com/2019/12/12/3568648672/
小白学 Python 爬虫(20):Xpath 进阶 https://www.geekdigging.com/2019/12/13/2569867940/
小白学 Python 爬虫(21): 解析库 Beautiful Soup(上) https://www.geekdigging.com/2019/12/15/2789385418/
小白学 Python 爬虫(22): 解析库 Beautiful Soup(下) https://www.geekdigging.com/2019/12/16/876770087/
小白学 Python 爬虫(23): 解析库 pyquery 入门 https://www.geekdigging.com/2019/12/17/876770088/
小白学 Python 爬虫(24):2019 豆瓣电影排行 https://www.geekdigging.com/2019/12/18/1275791678/
小白学 Python 爬虫(25): 爬取股票信息 https://www.geekdigging.com/2019/12/19/1066903974/
小白学 Python 爬虫(26): 为啥买不起上海二手房你都买不起 https://www.geekdigging.com/2019/12/20/788803015/
小白学 Python 爬虫(27): 自动化测试框架 Selenium 从入门到放弃(上) https://www.geekdigging.com/2019/12/22/151891020/
小白学 Python 爬虫(28): 自动化测试框架 Selenium 从入门到放弃(下) https://www.geekdigging.com/2019/12/24/1100772905/
小白学 Python 爬虫(29):Selenium 获取某大型电商网站商品信息 https://www.geekdigging.com/2019/12/25/7469407721/
小白学 Python 爬虫(30): 代理基础 https://www.geekdigging.com/2019/12/26/9565104888/
小白学 Python 爬虫(31): 自己构建一个简单的代理池 https://www.geekdigging.com/2019/12/30/8571162753/
小白学 Python 爬虫(32): 异步请求库 AIOHTTP 基础入门 https://www.geekdigging.com/2019/12/31/5591013128/
小白学 Python 爬虫(33): 爬虫框架 Scrapy 入门基础(一) https://www.geekdigging.com/2020/01/05/4332169375/
小白学 Python 爬虫(34): 爬虫框架 Scrapy 入门基础(二) https://www.geekdigging.com/2020/01/06/5952331367/
小白学 Python 爬虫(35): 爬虫框架 Scrapy 入门基础(三) Selector 选择器 https://www.geekdigging.com/2020/01/07/5324623576/
小白学 Python 爬虫(36): 爬虫框架 Scrapy 入门基础(四) Downloader Middleware https://www.geekdigging.com/2020/01/08/8971187302/
小白学 Python 爬虫(37): 爬虫框架 Scrapy 入门基础(五) Spider Middleware https://www.geekdigging.com/2020/01/09/4676268033/
小白学 Python 爬虫(38): 爬虫框架 Scrapy 入门基础(六) Item Pipeline https://www.geekdigging.com/2020/01/10/2131379794/
小白学 Python 爬虫(39): JavaScript 渲染服务 Scrapy-Splash 入门 https://www.geekdigging.com/2020/01/11/4748644439/
小白学 Python 爬虫 (40): 爬虫框架 Scrapy 入门基础(七) 对接 Selenium 实战 https://www.geekdigging.com/2020/01/13/546552645/
小白学 Python 爬虫 (41): 爬虫框架 Scrapy 入门基础(八) 对接 Splash 实战 https://www.geekdigging.com/2020/01/14/7203139876/
- import requests
- from pyquery import PyQuery
- import xlsxwriter
- headers = {
- 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ApplewebKit/537.36 (Khtml, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
- 'cookie': '__jsluid_s=6fc5b4a3b5235afbfdafff4bbf7e6dbd; PHPSESSID=v9hm8hc3s56ogrn8si12fejdm3; mfw_uuid=5e1db855-ab4a-da12-309c-afb9cf90d3dd; _r=baidu; _rp=a%3A2%3A%7Bs%3A1%3A%22p%22%3Bs%3A18%3A%22www.baidu.com%2Flink%22%3Bs%3A1%3A%22t%22%3Bi%3A1579006045%3B%7D; oad_n=a%3A5%3A%7Bs%3A5%3A%22refer%22%3Bs%3A21%3A%22https%3A%2F%2Fwww.baidu.com%22%3Bs%3A2%3A%22hp%22%3Bs%3A13%3A%22www.baidu.com%22%3Bs%3A3%3A%22oid%22%3Bi%3A1026%3Bs%3A2%3A%22dm%22%3Bs%3A15%3A%22www.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222020-01-14+20%3A47%3A25%22%3B%7D; __mfwothchid=referrer%7Cwww.baidu.com; __omc_chl=; __mfwc=referrer%7Cwww.baidu.com; Hm_lvt_8288b2ed37e5bc9b4c9f7008798d2de0=1579006048; uva=s%3A264%3A%22a%3A4%3A%7Bs%3A13%3A%22host_pre_time%22%3Bs%3A10%3A%222020-01-14%22%3Bs%3A2%3A%22lt%22%3Bi%3A1579006046%3Bs%3A10%3A%22last_refer%22%3Bs%3A137%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DuR5Oj9n_xm4TSj7_1drQ1HRnFTYNM0M2TCljkjVrdIiUE-B2qPgh0MifEkceLE_U%26wd%3D%26eqid%3D93c920a80002dc72000000035e1db85c%22%3Bs%3A5%3A%22rhost%22%3Bs%3A13%3A%22www.baidu.com%22%3B%7D%22%3B; __mfwurd=a%3A3%3A%7Bs%3A6%3A%22f_time%22%3Bi%3A1579006046%3Bs%3A9%3A%22f_rdomain%22%3Bs%3A13%3A%22www.baidu.com%22%3Bs%3A6%3A%22f_host%22%3Bs%3A3%3A%22www%22%3B%7D; __mfwuuid=5e1db855-ab4a-da12-309c-afb9cf90d3dd; UM_distinctid=16fa418373e40f-070db24dfac29d-c383f64-1fa400-16fa418373fe31; __jsluid_h=b3f11fd3c79469af5c49be9ecb7f7b86; __omc_r=; __mfwa=1579006047379.58159.3.1579011903001.1579015057723; __mfwlv=1579015057; __mfwvn=2; CNZZDATA30065558=cnzz_eid%3D448020855-1579003717-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1579014923; bottom_ad_status=0; __mfwb=5e663dbc8869.7.direct; __mfwlt=1579019025; Hm_lpvt_8288b2ed37e5bc9b4c9f7008798d2de0=1579019026; __jsl_clearance=1579019146.235|0|fpZQ1rm7BHtgd6GdjVUIX8FJJ9o%3D'
- }
- s = requests.Session()
- value = []
- def getList(maxNum):
- """
- 获取列表页面数据
- :param maxNum: 最大抓取页数
- :return:
- """ url ='http://www.mafengwo.cn/gonglve/'
- s.get(url, headers = headers)
- for page in range(1, maxNum + 1):
- data = {'page': page}
- response = s.post(url, data = data, headers = headers)
- doc = PyQuery(response.text)
- items = doc('.feed-item').items()
- for item in items:
- if item('.type strong').text() == '游记':
- # 如果是游记, 则进入内页数据抓取
- inner_url = item('a').attr('href')
- getInfo(inner_url)
- def getInfo(url):
- """
- 获取内页数据
- :param url: 内页链接
- :return:
- """
- response = s.get(url, headers = headers)
- doc = PyQuery(response.text)
- title = doc('title').text()
- # 获取数据采集区
- item = doc('.tarvel_dir_list')
- if len(item) == 0:
- return
- time = item('.time').text()
- day = item('.day').text()
- people = item('.people').text()
- cost = item('.cost').text()
- # 数据格式化
- if time == '':
- pass
- else:
- time = time.split('/')[1] if len(time.split('/'))> 1 else '' if day =='':
- pass
- else:
- day = day.split('/')[1] if len(day.split('/'))> 1 else '' if people =='':
- pass
- else:
- people = people.split('/')[1] if len(people.split('/'))> 1 else '' if cost =='':
- pass
- else:
- cost = cost.split('/')[1] if len(cost.split('/'))> 1 else ''
- value.append([title, time, day, people, cost, url])
- def write_excel_xlsx(value):
- """
- 数据写入 Excel
- :param value:
- :return:
- """
- index = len(value)
- workbook = xlsxwriter.Workbook('mfw.xlsx')
- sheet = workbook.add_worksheet()
- for i in range(1, index + 1):
- row = 'A' + str(i)
- sheet.write_row(row, value[i - 1])
- workbook.close()
- print("xlsx 格式表格写入数据成功!")
- def main():
- getList(5)
- write_excel_xlsx(value)
- if __name__ == '__main__':
- main()
来源: https://www.cnblogs.com/babycomeon/p/12204167.html