作为互联网界的两个对立的物种, 产品汪与程序猿似乎就像一对天生的死对头; 但是在产品开发链条上紧密合作的双方, 只有通力合作, 才能更好地推动项目发展. 那么产品经理平日里面都在看那些文章呢? 我们程序猿该如何投其所好呢? 我爬取了人人都是产品经理栏目下的所有文章, 看看产品经理都喜欢看什么.
<
image
1. 分析背景
1.1. 为什么选择「人人都是产品经理」
人人都是产品经理是以产品经理, 运营为核心的学习, 交流, 分享平台, 集媒体, 培训, 招聘, 社群为一体, 全方位服务产品人和运营人, 成立 8 年举办在线讲座 500 + 期, 线下分享会 300 + 场, 产品经理大会, 运营大会 20 + 场, 覆盖北上广深杭成都等 15 个城市, 在行业有较高的影响力和知名度. 平台聚集了众多 BAT 美团京东滴滴 360 小米网易等知名互联网公司产品总监和运营总监. 选取这个社区更有代表性.
1.2. 分析内容
- Python 3.6
- Matplotlib
- WordCloud
- Jieba
- import requests
- from bs4 import BeautifulSoup
- import CSV
- headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8',
- 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
- 'Cache-Control': 'max-age=0',
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
- 'Connection': 'keep-alive',
- 'Host': 'www.woshipm.com' http://www.woshipm.com'/ ,
- 'Cookie' : 't=MHpOYzlnMmp6dkFJTEVmS3pDeldrSWRTazlBOXpkRjBzRXpZOU4yVkNZWWl5QVhMVXBjMU5WcnpwQ2NCQS90ZkVsZ3lTU2Z0T3puVVZFWFRFOXR1TnVrbUV2UFlsQWxuemY4NG1wWFRYMENVdDRPQ1psK0NFZGJDZ0lsN3BQZmo=; s=Njg4NDkxLCwxNTQyMTk0MTEzMDI5LCxodHRwczovL3N0YXRpYy53b3NoaXBtLmNvbS9XWF9VXzIwMTgwNV8yMDE4MDUyMjE2MTcxN180OTQ0LmpwZz9pbWFnZVZpZXcyLzIvdy84MCwsJUU1JUE0JUE3JUU4JTk5JUJF; Hm_lvt_b85cbcc76e92e3fd79be8f2fed0f504f=1547467553,1547544101,1547874937,1547952696; Hm_lpvt_b85cbcc76e92e3fd79be8f2fed0f504f=1547953708'
- }
- for page_number in range(1, 549):
- page_url = "http://www.woshipm.com/category/pmd/page/ {}".format(page_number)
- print('正在抓取第' + str(page_number) + '页>>>')
- response = requests.get(url=page_url, headers=headers)
- </pre>
- <1#!/usr/bin/env python
- # -- encoding: utf-8 --
- import requests
- from bs4 import BeautifulSoup
- import CSV
- headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8',
- 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
- 'Cache-Control': 'max-age=0',
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
- 'Connection': 'keep-alive',
- 'Host': 'www.woshipm.com' http://www.woshipm.com'/ ,
- 'Cookie' : 't=MHpOYzlnMmp6dkFJTEVmS3pDeldrSWRTazlBOXpkRjBzRXpZOU4yVkNZWWl5QVhMVXBjMU5WcnpwQ2NCQS90ZkVsZ3lTU2Z0T3puVVZFWFRFOXR1TnVrbUV2UFlsQWxuemY4NG1wWFRYMENVdDRPQ1psK0NFZGJDZ0lsN3BQZmo=; s=Njg4NDkxLCwxNTQyMTk0MTEzMDI5LCxodHRwczovL3N0YXRpYy53b3NoaXBtLmNvbS9XWF9VXzIwMTgwNV8yMDE4MDUyMjE2MTcxN180OTQ0LmpwZz9pbWFnZVZpZXcyLzIvdy84MCwsJUU1JUE0JUE3JUU4JTk5JUJF; Hm_lvt_b85cbcc76e92e3fd79be8f2fed0f504f=1547467553,1547544101,1547874937,1547952696; Hm_lpvt_b85cbcc76e92e3fd79be8f2fed0f504f=1547953708'
- }
- with open('data.csv', 'w', encoding='utf-8',newline='') as csvfile:
- fieldnames = ['title', 'author', 'author_des', 'date', 'views', 'loves', 'zans', 'comment_num','art', 'url']
- writer = CSV.DictWriter(csvfile, fieldnames=fieldnames)
- writer.writeheader()
- for page_number in range(1, 549):
- page_url = "http://www.woshipm.com/category/pmd/page/ {}".format(page_number)
- print('正在抓取第' + str(page_number) + '页>>>')
- response = requests.get(url=page_url, headers=headers)
- if response.status_code == 200:
- page_data = response.text
- if page_data:
- soup = BeautifulSoup(page_data, 'lxml')
- article_urls = soup.find_all("h2", class_="post-title")
- for item in article_urls:
- url = item.find('a').get('href')
- # 文章页面解析, 获取文章标题, 作者, 作者简介, 日期, 浏览量, 收藏量, 点赞量, 评论量, 正文, 文章链接
- response = requests.get(url=url, headers=headers)
- # time.sleep(3)
- print('正在抓取:' + url)
- # print(response.status_code)
- if response.status_code == 200:
- article = response.text
- # print(article)
- if article:
- try:
- soup = BeautifulSoup(article, 'lxml')
- # 文章标题
- title = soup.find(class_='article-title').get_text().strip()
- # 作者
- author = soup.find(class_='post-meta-items').find_previous_siblings()[1].find('a').get_text().strip()
- # 作者简介
- author_des = soup.find(class_='post-meta-items').find_previous_siblings()[0].get_text().strip()
- # 日期
- date = soup.find(class_='post-meta-items').find_all(class_='post-meta-item')[0].get_text().strip()
- # 浏览量
- views = soup.find(class_='post-meta-items').find_all(class_='post-meta-item')[1].get_text().strip()
- # 收藏量
- loves = soup.find(class_='post-meta-items').find_all(class_='post-meta-item')[2].get_text().strip()
- # 点赞量
- zans = soup.find(class_='post-meta-items').find_all(class_='post-meta-item')[3].get_text().strip()
- # 评论量
- comment = soup.find('ol', class_="comment-list").find_all('li')
- comment_num = len(comment)
- # 正文
- art = soup.find(class_="grap").get_text().strip()
- writer.writerow({'title':title, 'author':author, 'author_des':author_des, 'date':date, 'views':views, 'loves':int(loves), 'zans':int(zans), 'comment_num':int(comment_num), 'art':art, 'url':url})
- print({'title':title, 'author':author, 'author_des':author_des, 'date':date, 'views':views, 'loves':loves, 'zans':zans, 'comment_num':comment_num})
- except:
- print('抓取失败')
- print("抓取完毕!")
- </pre>
- 1 # 评论量
- 2 comment = soup.find('ol', class_="comment-list").find_all('li')
- 3 comment_num = len(comment)
- </pre>
- 1# 将 CSV 数据转为 dataframe
- 2csv_file = "data.csv"
- 3csv_data = pd.read_csv(csv_file, low_memory=False) # 防止弹出警告
- 4csv_df = pd.DataFrame(csv_data)
- 5print(csv_df)
- </pre>
- print(csv_df.shape) # 查看行数和列数
- print(csv_df.info()) # 查看总体情况
- print(csv_df.head()) # 输出前 5 行
- # 运行结果
- (6574, 10)
- <class 'pandas.core.frame.DataFrame'>
- RangeIndex: 6574 entries, 0 to 6573
- Data columns (total 10 columns):
- title 6574 non-null object
- author 6574 non-null object
- author_des 6135 non-null object
- date 6574 non-null object
- views 6574 non-null object
- loves 6574 non-null int64
- zans 6574 non-null int64
- comment_num 6574 non-null int64
- art 6574 non-null object
- url 6574 non-null object
- dtypes: int64(3), object(7)
- memory usage: 513.7+ KB
- None
- title ... url
- 28
- 29[5 rows x 10 columns]
- </pre>
- 1# 修改 date 列时间, 并转换为 datetime 格式
- 2csv_df['date'] = pd.to_datetime(csv_df['date'])
- </pre>
- #!/usr/bin/env python
- # -- encoding: utf-8 --
- import pandas as pd
- import numpy as np
- import matplotlib.pyplot as plt
- import seaborn as sns
- import re
- from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
- import jieba
- import os
- from PIL import Image
- from os import path
- from decimal import *
- #views 列处理
- def views_to_num(item):
- m = re.search('.?(万)',item['views'])
- if m:
- ns = item['views'][:-1]
- nss = Decimal(ns)10000
- else:
- nss = item['views']
- return int(nss)
- # 数据清洗处理
- def parse_woshipm():
- # 将 CSV 数据转为 dataframe
- csv_file = "data.csv"
- csv_data = pd.read_csv(csv_file, low_memory=False) # 防止弹出警告
- csv_df = pd.DataFrame(csv_data)
- # print(csv_df.shape) # 查看行数和列数
- # print(csv_df.info()) # 查看总体情况
- # print(csv_df.head()) # 输出前 5 行
- # 修改 date 列时间, 并转换为 datetime 格式
- csv_df['date'] = pd.to_datetime(csv_df['date'])
- #将 views 字符串数字化, 增加一列 views_num
- csv_df['views_num'] = csv_df.apply(views_to_num,axis = 1)
- print(csv_df.info())
- if name == 'main':
- parse_woshipm()
- </pre>
- <class 'pandas.core.frame.DataFrame'>
- RangeIndex: 6574 entries, 0 to 6573 Data columns (total 11 columns): title
- 6574 non-null object author 6574 non-null object author_des 6135 non-null
- object date 6574 non-null datetime64[ns] views 6574 non-null object loves
- 6574 non-null int64 zans 6574 non-null int64 comment_num 6574 non-null
- int64 art 6574 non-null object url 6574 non-null object views_num 6574
- non-null int64 dtypes: datetime64ns, int64(4), object(6) memory usage:
- 565.0+ KB None
- </pre>
- # 判断整行是否有重复值, 如果运行结果为 True, 表明有重复值
- # print(any(csv_df.duplicated()))
- # 显示 True, 表明有重复值, 进一步提取出重复值数量
- data_duplicated = csv_df.duplicated().value_counts()
- # print(data_duplicated)
- # 运行结果
- # True
- # False
- # 6562
- # True
- # 12
- # dtype: int64
- # 删除重复值
- data = csv_df.drop_duplicates(keep='first')
- # 删除部分行后, index 中断, 需重新设置 index
- data = data.reset_index(drop=True)
- </pre>
- 1# 增加标题长度列和年份列
- 2data['title_length'] = data['title'].apply(len)
- 3data['year'] = data['date'].dt.year
- </pre>
- print(data['author'].describe())
- print(data['date'].describe())
- # 结果
- count 6562
- unique 1531
- top Nairo
- freq 315
- Name: author, dtype: object
- count 6562
- unique 1827
- top 2015-01-29 00:00:00
- freq 16
- first 2012-11-25 00:00:00
- last 2019-01-21 00:00:00
- Name: date, dtype: object
- </pre>
来源: http://www.jianshu.com/p/27a1e0ca2b18