需求分析
简书文章的分析功能比较弱, 只能按照热度排序, 从页面上看, 热度指的是点赞数.
热度
可文章还有其他的分析维度: 阅读数, 评论数, 点赞数. 简书并没有提供对这些维度的分析.
既然如此, 就自己撸起袖子干吧...
实现的需求很简单: 将自己简书文章的阅读, 评论, 点赞, 打赏, 标题, 发布时间抓取下来, 存入数据库, 再进行分析展示
效果如下:
简书文章分析. gif
以上只是最简单的展示, 可以自定义其他数据分析效果
具体实现
数据抓取
使用 python 抓取页面数据, 抓取之前先分析页面的 html 结构
分析 html 结构
具体实现代码:
- # -*- coding: utf-8 -*-
- import requests
- import pyquery
- import time
- import datetime
- import pymysql
- # 数据库连接信息
- conn = pymysql.connect(host='127.0.0.1', user='root', passwd=None, db='test', charset='utf8')
- cur = conn.cursor()
- user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ApplewebKit/537.36' \
- '(KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
- headers = {"User-Agent": "user-agent:%s" % user_agent}
- page = 0
- flag = True
- while flag:
- baseUrl = 'https://www.jianshu.com/u/f9338eda7dda?page='
- page = int(page) + 1
- url = baseUrl + str(page)
- print(url)
- # 抓取数据
- req = requests.get(url, headers=headers, timeout=2)
- pageText = req.text
- pq = pyquery.PyQuery(pageText)
- contents = pq('li')
- for x in contents:
- el = pq(x)
- title = el.find('a.title').text()
- if title:
- nodeId = el.attr('data-note-id')
- # data-note-id 为空时, 表示文章已抓取完毕, 此时退出循环
- if nodeId is None:
- flag = False
- break
- link = 'https://www.jianshu.com' + el.find('a.title').attr('href') # 文章链接
- postTime = el.find('span.time').attr('data-shared-at') # 发布时间
- dateTime = datetime.datetime.strptime(postTime, "%Y-%m-%dT%H:%M:%S 08:00")
- create_time = int(time.mktime(dateTime.timetuple()))
- read_num = el.find('i.ic-list-read').parent().text() # 阅读数
- comment_num = el.find('i.ic-list-comments').parent().text() # 评论数
- like_num = el.find('i.ic-list-like').parent().text() # 点赞数
- money_num = el.find('i.ic-list-money').parent().text() # 打赏数
- if money_num is '':
- money_num = 0
- # 数据入库
- analyze_time = int(time.time())
- sql = "insert into analyze_article \
- (title, link, create_time, analyze_time, read_num, like_num, comment_num, money_num) values \
- ('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')" % \
- (title, link, create_time, analyze_time, read_num, like_num, comment_num, money_num)
- cur.execute(sql)
- conn.commit()
- # 暂停 1 秒, 避免被简书的反爬虫拦截
- time.sleep(1)
php 读取数据
爬虫将数据入库后, 用 php 作为服务端读取数据表数据
极简单的数据读取脚本, 无需解释, 直接贴代码
- <?php
- header("Access-Control-Allow-Origin:*"); // 如果客户端和服务端不同域, 要加上这行代码, 不然会报跨域错误
- $con=mysqli_connect("localhost","root","","test");
- $analyzeTime = strtotime(date('Y-m-m', time())) - 3600 * 24;
- $sql="SELECT * FROM analyze_article where analyze_time>= $analyzeTime";
- $order = '';
- if (isset($_GET['read_num'])) {
- $order = "order by read_num desc";
- }
- if (isset($_GET['like_num'])) {
- $order = "order by like_num desc";
- }
- if (isset($_GET['comment_num'])) {
- $order = "order by comment_num desc";
- }
- if (isset($_GET['money_num'])) {
- $order = "order by money_num desc";
- }
- $sql .= $order;
- $result=mysqli_query($con,$sql);
- $data=mysqli_fetch_all($result, MYSQLI_ASSOC);
- mysqli_free_result($result);
- mysqli_close($con);
- echo json_encode($data, true);
前端使用 vue.js 展现
php 后端返回 json 数据, vue.js 将 json 数据解析展现到页面
- <!doctype html>
- <html lang="en">
- <head>
- <meta charset="UTF-8">
- <link href="https://cdn.bootCSS.com/bootstrap/3.3.7/css/bootstrap.min.css" rel="stylesheet">
- <script src="https://cdn.jsdelivr.net/npm/vue/dist/vue.js"></script>
- <script src="https://unpkg.com/axios/dist/axios.min.js"></script>
- <title > 简书文章分析 </title>
- <style>
- .container {
- margin-top: 2%;
- }
- </style>
- </head>
- <body>
- <div class="container">
- <div id="app">
- <h3 class="text-center"> 简书文章分析 </h3>
- <table class="table table-bordered table-hover">
- <tr>
- <th > 标题 </th>
- <th><a href=""@click.prevent="changeOrder('read_num')"class="text-info"> 阅读 </a></th>
- <th><a href=""@click.prevent="changeOrder('like_num')"class="text-danger"> 点赞 </a></th>
- <th><a href=""@click.prevent="changeOrder('comment_num')"class="text-warning"> 评论 </a></th>
- <th><a href=""@click.prevent="changeOrder('money_num')"class="text-success"> 打赏 </a></th>
- </tr>
- <tr v-for="item in list">
- <td><a :href="item.link" target="_blank">{{ item.title }}</a></td>
- <td>{{ item.read_num }}</td>
- <td>{{ item.like_num }}</td>
- <td>{{ item.comment_num }}</td>
- <td>{{ item.money_num }}</td>
- </tr>
- </table>
- </div>
- </div>
- <script>
- let url = 'http://local.php.com/jianshu.php';
- let vm = new Vue({
- el: '#app',
- data: {
- list: []
- },
- methods: {
- changeOrder: function (sign) {
- let reqUrl = url + '?' + sign + '=1'
- axios.get(reqUrl, {})
- .then(function (response) {
- vm.$data.list = response.data;
- })
- },
- }
- });
- axios.get(url, {})
- .then(function (response) {
- vm.$data.list = response.data;
- })
- .catch(function (error) {
- console.log(error);
- })
- .then(function () {
- // always executed
- });
- </script>
- </body>
- </html>
对于 vue.js 不熟悉的同学, 推荐查看: 实例学习 vue.js 目录
小结
除了以上极简的按不同维度排序外, 还可以从不同角度进行分析, 前提是你的数据量要多, 你也可以拿那些大 v 的简书主页放到程序中进行分析, 有助于你了解大 v 的文章好在哪里.
完整的源包下载 https://gitee.com/zhiqiexing/program/tree/8ac3f46e2e51ead0d8f24f73cf27bc9a3ca79a52/analyze_jianshu
来源: http://www.jianshu.com/p/e837dfd7f4b7